diff --git a/CHANGELOG.md b/CHANGELOG.md index 8bd1630..0cd1d30 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,86 @@ The Python import name is `codegraph`; the PyPI package and CLI are `cgh`. ## [Unreleased] +## [0.5.0] - 2026-06-08 + +A large feature release built on a full code audit (security, correctness, +readability, and roadmap). The MCP server now exposes 47 tools, there is a +new CI-oriented CLI command, broader language and framework coverage, and two +optional extras. Everything is additive and backwards compatible; the new +extras are opt-in and defaults are unchanged. + +### Added +- **Code-intelligence MCP tools**: `file_summary` (one-shot file orientation), + `impact_of` (reverse blast radius), `path_between` (shortest call/import + path), `import_cycles` (SCC cycle detection), `tests_for` / `untested` + (test-to-code mapping inferred from imports/calls + roles), `hotspots` + (git churn x import centrality), and `who_knows` (file ownership from git). +- **`role` / `layer` filters** on `search_symbols` and `symbol_lookup`. +- **`cgh impact --since `**: a non-MCP CLI command for CI and PR bots that + reports changed symbols, blast radius grouped by role/layer, endpoints + touched, and tests to run, as a markdown summary or JSON. Reads the graph + read-only, so no server needs to be running. +- **`cgh graph layers`**: a layer-to-layer dependency diagram (Mermaid/Graphviz). +- **Config-as-data parsers** for JSON / JSONC, YAML, and TOML (top-level keys + become navigable sections: CI jobs, k8s kinds, compose services, + package.json scripts, pyproject tables), and a **SQL DDL parser** that turns + `CREATE TABLE` / `ALTER TABLE` into table sections with columns. +- **More endpoint frameworks**: Django urls, NestJS, Spring, and Gin/Echo, on + top of the existing FastAPI / Flask / Nuxt / Express. +- **Optional `langs` extra** (`pip install "cgh[langs]"`): C# and Ruby + tree-sitter parsers, kept optional so the core install stays lean and + Python-3.14-safe. +- **Optional `lsp` extra** (`pip install "cgh[lsp]"`): opt-in precise + cross-file CALLS resolution for Python via jedi, behind a `precise_calls` + config flag (or `CGH_PRECISE_CALLS`). +- **Walk-up root discovery**: `cgh` now resolves the nearest ancestor + `.codegraph/` from any subdirectory, the way git finds its repo root, so the + commands work from anywhere inside an initialized project. + +### Fixed +- **DuckDB / Kuzu parity**: `purge_file_data` now also removes the inbound side + of self-referential edges (CALLS, INHERITS) on DuckDB, so `find_callers` no + longer returns ghost callers after a symbol changes. +- **CALLS resolution** prefers a same-file definition before falling back to + repo-wide name matching, cutting spurious cross-file edges, and memoizes + lookups per file. +- The indexer now **honors `max_file_size_kb` and `ignore_patterns`** (they + were defined and documented but never enforced). +- **Federated subrepos are skipped on Windows.** `is_under_any` left an + absolute candidate path unresolved and compared case-sensitively, so on the + case-insensitive Windows filesystem every federated subrepo missed the skip + list and the parent scanned the whole tree. Paths are now resolved and + case-normalized on both sides. +- Module-level FTS and `.cghignore` caches are keyed by repo root, so a + multi-repo process no longer crosses streams. +- `cgh status` shows `would create graph.duckdb` (not the Kuzu file) and + `Endpoints: unknown` instead of a bare comma when the graph is unreadable. +- Markdown links resolve relative to the file that contains them. +- Barrel re-exports cap their per-import symbol edges; the git-diff discovery + timeout matches `git ls-files`; `find` prunes ignore dirs at the walk level; + and several silently-swallowed failures (connection close, query iteration, + scan deletions) are now surfaced. + +### Changed +- The parent + children federation fan-out is now a single shared helper + (`federate_scoped` / `federate_flat`); the server modules use the canonical + `_graphdb` names instead of the deprecated `_kuzu` aliases. +- `cmd_init` and `cmd_status` were decomposed into named phase helpers, the + repeated `--root` argparse boilerplate was factored out, and CLI handlers + are typed; `cmd_status`'s owner/RO/FTS fallback ladder gained tests. + +### Security +- The owner's bearer-token check is now constant-time (`hmac.compare_digest`). +- Removed the dead `.mcp.json` auth env-injection path: the `0600` + `.codegraph/auth.key` file is the shared secret, and `.codegraph/` is created + `0700`. Corrected the auth documentation to match. +- `index_changed_files` rejects a `since` ref beginning with `-`, and + `pattern_search` passes the user pattern after `--` (ripgrep) / via `-e` + (git-grep), closing argument-injection vectors that could reach ripgrep's + preprocessor. +- `force_index` refuses absolute paths that resolve outside the repo. +- The generated HTML diagram pins the Mermaid CDN script with an SRI hash. + ## [0.4.6] - 2026-06-06 A cross-platform audit pass. Five parallel reviews of signals, paths, file @@ -194,7 +274,8 @@ Highlights from this line: First tagged release on PyPI. -[Unreleased]: https://github.com/altikva/cgh/compare/v0.4.6...HEAD +[Unreleased]: https://github.com/altikva/cgh/compare/v0.5.0...HEAD +[0.5.0]: https://github.com/altikva/cgh/compare/v0.4.6...v0.5.0 [0.4.6]: https://github.com/altikva/cgh/compare/v0.4.5...v0.4.6 [0.4.5]: https://github.com/altikva/cgh/compare/v0.4.4...v0.4.5 [0.4.4]: https://github.com/altikva/cgh/compare/v0.4.3...v0.4.4 diff --git a/README.md b/README.md index 69cb1fb..41bcd73 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,19 @@ cd cgh uv pip install -e . # or: pip install -e . ``` +Optional extras (none are required; the core install is lean and works on Python 3.11 through 3.14): + +```bash +pip install "cgh[langs]" # C# and Ruby parsers (tree-sitter grammars) +pip install "cgh[lsp]" # precise cross-file Python call resolution (jedi) +pip install "cgh[kuzu]" # the legacy Kuzu graph backend (DuckDB is the default) + +# Combine extras in one bracket, comma-separated: +pip install "cgh[langs,lsp]" +``` + +Quote the package spec (`"cgh[...]"`) so zsh and bash do not try to glob the brackets. The same form works with `pipx install`, `uv tool install`, `uv pip install`, and from a source checkout: `pip install -e ".[langs,lsp]"`. + ```bash cgh --version cgh init # initialize in any project @@ -134,101 +147,6 @@ Then reads only lines 42-55. --- -## Architecture (v0.4) - -``` -codegraph/ - __init__.py # version - __main__.py # thin argparse + dispatch - indexer.py # parse + graph ingestion engine (the only - # top-level .py beside the entrypoints) - - core/ # shared infrastructure - protocol.py # GraphDB / QueryResult Protocols (backend boundary) - db.py # backend selection + cached conns - db_duckdb.py # DuckDB adapter (default backend since v0.4) - db_kuzu.py # Kuzu adapter (opt-in via CGH_DB=kuzu) - graph_model.py # NODES / EDGES schema dicts shared by both backends - schema.py # Kuzu DDL - schema_duckdb.py # DuckDB DDL - utils.py # rows(), short_path(), normalize_identifier(), ... - config.py # layered TOML config - fts.py # BM25 full-text search (SQLite FTS5) - - parsers/ # plugin registry (auto-discovery) - base.py # BaseParser ABC + FileIndex dataclass - builtins.py # per-language built-in callables (filter) - python.py, typescript.py, vue.py - golang.py, rust.py, java.py - terraform.py, markdown.py, plaintext.py - - imports/ # JS / TS import resolution - resolver.py # entry point: resolves any import to a Path - tsconfig.py # tsconfig.json paths + extends + JSONC - workspaces.py # npm / pnpm / yarn / lerna workspace packages - - state/ # per-repo runtime state - auth.py # MCP auth key - activity.py # activity log (parse_error, scan events) - call_log.py # per-MCP-tool call log + stats - scan_meta.py # last-indexed git SHA + freshness - watcher.py # watchdog-based live file watcher - ipc.py # stdio <-> HTTP bridge for cgh serve - pidfile.py # single-writer lock for the owner process - - analysis/ # higher-level graph analysis - context_builder.py # AI context builder (graph + FTS) - dead_code.py # unused symbol detection - federation.py # multi-repo federation (parent reads child DBs) - endpoints.py # HTTP endpoint extraction (FastAPI / Nuxt) - module_doc.py # one-line module summary extraction - roles.py # file role classification by path conventions - pattern.py # regex / substring search across indexed files - - claude_state/ # indexing of ~/.claude/* state - memory.py # memory_search / memory_list FTS index - plans.py # plan_search / plan_list FTS index - - server/ # MCP server - __init__.py # FastMCP setup + owner process main() - tools_query.py # symbol_lookup, callers, callees, imports, subgraph - tools_docs.py # search_docs, doc_outline, doc_refs - tools_index.py # scan_repo, index_changed, force_index - tools_viz.py # visualize_graph, graph_stats - tools_arch.py # arch / role / endpoint tools - tools_meta.py # fts_search, dead_code, context_for_task - - cli/ # Rich CLI - commands_init.py # init, setup, parsers - commands_index.py # index, watch, serve, force-index - commands_monitor.py # stats, status, logs, history, diff, doctor - commands_query.py # search, lookup, callers, callees, outline - commands_graph.py # graph, add-dir - commands_hooks.py # _hook_precheck_grep / _hook_precheck_read - commands_federate.py # federate add / list / verify - - integrations/ # per-AI-tool integration glue - skill_installer.py # install bundled skills to .claude/, .cursor/, ... - post_commit.py # Claude Code post-commit hook handler - claude-code.md, cursor.md, codex.md, gemini.md - - viz/ # diagram rendering - mermaid.py # Mermaid diagram generators - html.py # HTML template + browser open - - skills/ # bundled Claude Code skills shipped with the package - -tests/ # 235+ tests (pytest) - test_parsers/ # Python, TS, Go, Rust, Java, tsconfig, workspaces - test_core/ # db, utils, normalize_identifier - test_indexer/ # engine, builtins, safe_extract, imports_edges - test_search/ # FTS - test_server/ # MCP server tools, federation - test_memory_plans/ # memory + plan indexing -``` - ---- - ## CLI Reference All commands accept `--root ` to target a different project. The CLI is `cgh`; the Python import is `codegraph` (see Install above). @@ -443,7 +361,7 @@ cgh graph imports --html out.html # save to file instead of opening browser cgh graph overview --max-nodes 20 # limit nodes ``` -Scopes: `overview`, `imports`, `calls`, `classes`, `docs` +Scopes: `overview`, `imports`, `calls`, `classes`, `docs`, `layers` (layer-to-layer dependency diagram) ```text +--------------------------------------------+ @@ -456,6 +374,20 @@ Scopes: `overview`, `imports`, `calls`, `classes`, `docs` +--------------------------------------------+ ``` +`cgh graph layers --mermaid` prints the layer-dependency diagram, which renders inline on GitHub: + +```mermaid +graph TD + presentation["presentation"]:::layer + application["application"]:::layer + test["test"]:::layer + other["other"]:::layer + presentation -->|2| other + application -->|1| other + test -->|2| other + classDef layer fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px +``` + ### Monitor #### `stats` @@ -572,6 +504,37 @@ cgh diff --since main +----------------------------------------------+ ``` +#### `impact` + +Report the blast radius of changes since a git ref, for CI and PR bots. Diffs the changed files, then reports the symbols they define, what transitively imports them (grouped by role/layer), the endpoints touched, and the tests to run. Reads the graph read-only, so no MCP server needs to be running; keep the index fresh with `cgh index` in CI. + +```bash +cgh impact --since HEAD~1 # human-readable summary +cgh impact --since main --format md # markdown for a PR comment +cgh impact --since main --json # machine-parseable on clean stdout +``` + +```text +## cgh impact (since `HEAD~1`) + +**Changed files (1)** +- `api/models/donation.py` + +**Impacted files (3)** +- application (1) + - `api/handlers/receipt_handler.py` `handler` +- presentation (1) + - `api/routers/donations.py` `router` +- test (1) + - `tests/test_receipt.py` `test` + +**Endpoints touched (1)** +- `POST /donations` (api/routers/donations.py) + +**Tests to run (1)** +- `tests/test_receipt.py` +``` + #### `parsers` List all registered language parsers. @@ -775,7 +738,7 @@ Owners are independent: the parent reads child DBs directly as files, it does NO ## MCP Tools -When running as an MCP server (`cgh serve`), codegraph exposes 39 tools. +When running as an MCP server (`cgh serve`), codegraph exposes 47 tools. ### Architecture Awareness (call these FIRST) @@ -783,20 +746,33 @@ When running as an MCP server (`cgh serve`), codegraph exposes 39 tools. |------|-------------| | `architecture_overview(max_files_per_role?)` | Compact map of all files grouped by layer (presentation/application/domain/infra/test/doc) and role (handler/router/component/store/…) with 1-line summaries: no Read needed | | `domain_map(keyword, limit_per_role?)` | Every file whose path / role / module_doc mentions the keyword, grouped by role | -| `endpoints(path_pattern?, method?)` | List HTTP endpoints (FastAPI decorators + Nuxt server/api file routes + Express) with their handlers: works cross-repo when `extra_dirs` is configured | +| `endpoints(path_pattern?, method?)` | List HTTP endpoints (FastAPI, Flask, Nuxt, Express, Django urls, NestJS, Spring, Gin/Echo) with their handlers: works cross-repo when `extra_dirs` is configured | ### Code Navigation | Tool | Description | |------|-------------| -| `symbol_lookup(name)` | Find where a function, class, TF resource, or doc section is defined | +| `symbol_lookup(name, role?, layer?)` | Find where a function, class, TF resource, or doc section is defined; optional `role` / `layer` filters | | `find_callers(fn_name)` | Find all functions that call `fn_name` | | `find_callees(fn_name)` | Find all functions that `fn_name` calls | | `imports_of(file_path)` | List modules imported by a file | -| `search_symbols(query, limit?)` | Fuzzy search across all symbol types | +| `search_symbols(query, limit?, role?, layer?)` | Fuzzy search across all symbol types; optional `role` / `layer` filters | | `subgraph(file_path, depth?)` | Find files related within N import hops (blast radius) | | `graph_stats()` | Node and edge counts per type | +### Code Intelligence + +| Tool | Description | +|------|-------------| +| `file_summary(file_path)` | One-shot orientation for a file: role/layer/lang, its functions and classes with line ranges, what it imports, and who imports it | +| `impact_of(symbol_or_file, max_depth?)` | Reverse blast radius: everything that transitively calls or imports the target, grouped by role/layer, with reaching endpoints | +| `path_between(src, dst, edge?)` | Shortest path between two symbols/files over `CALLS` or `IMPORTS` | +| `import_cycles(limit?)` | Detect import cycles (strongly-connected components) in the file import graph | +| `tests_for(symbol_or_file)` | Test files that exercise the target (inferred from imports/calls + role, not coverage) | +| `untested(role?, layer?)` | Source files that no test file imports | +| `hotspots(limit?)` | Change-risk ranking: git churn x import centrality x recency | +| `who_knows(file_path)` | Top authors of a file by commit count and recency (from git history) | + ### Documentation | Tool | Description | @@ -861,6 +837,12 @@ codegraph supports any language through a plugin system. Adding a new language r | Java | tree-sitter | `.java` | classes, interfaces, methods, constructors, imports, calls | | Terraform | regex + brace tracker | `.tf` | resources, variables, outputs, depends_on | | Markdown | regex | `.md` `.mdx` | headings, internal links, code symbol references | +| Config data | stdlib + PyYAML | `.json` `.yaml` `.yml` `.toml` | top-level keys as sections (CI jobs, k8s kinds, compose services, package.json scripts, pyproject tables) | +| SQL | regex | `.sql` | `CREATE TABLE` / `ALTER TABLE` as table sections with columns | +| C# (optional) | tree-sitter | `.cs` | classes, interfaces, structs, enums, records, methods, usings, calls | +| Ruby (optional) | tree-sitter | `.rb` | classes, modules, methods, requires, calls | + +C# and Ruby ship in the optional `langs` extra (`pip install cgh[langs]`) so the core install stays lean and Python-3.14-safe. When the extra is absent, those file types are simply skipped. ### Adding a New Language @@ -912,6 +894,7 @@ ignore_dirs = [".git", "node_modules", "__pycache__", ".venv"] ignore_patterns = ["*.min.js", "*.bundle.js"] max_file_size_kb = 500 extra_dirs = ["../frontend"] +# precise_calls = true # resolve Python calls cross-file via jedi (needs cgh[lsp]) [parsers] # enabled = ["python", "typescript", "markdown"] @@ -928,7 +911,8 @@ reindex_on_start = true |----------|-------------| | `CODEGRAPH_ROOT` | Override project root | | `CODEGRAPH_DIR` | Override `.codegraph/` location | -| `CODEGRAPH_AUTH_KEY` | MCP server auth key (auto-generated by `cgh init`, injected into `.mcp.json`) | +| `CGH_DB` | Graph backend: `duckdb` (default) or `kuzu` | +| `CGH_PRECISE_CALLS` | `1` to resolve Python calls cross-file via jedi (needs `cgh[lsp]`) | ### `.cghignore` @@ -1032,23 +1016,20 @@ MdSection --MD_REFS_CLASS-----> Class (code references in docs) ### MCP Auth Key -`cgh init` generates a cryptographic auth key at `.codegraph/auth.key` (auto-added to `.gitignore`). The key is injected into `.mcp.json` as the `CODEGRAPH_AUTH_KEY` environment variable. - -This is defense-in-depth for when codegraph moves to HTTP transport. Over stdio, the key provides process-level authentication. +`cgh init` generates a cryptographic auth key at `.codegraph/auth.key` (auto-added to `.gitignore`). The owner process and every worker / CLI caller read that file and send it as a `Bearer` token to the owner's loopback HTTP bridge, which compares it in constant time. The file contents are the shared secret: there is no environment-variable hand-off. ```bash # Key is auto-managed -- no manual steps needed -cgh init # generates key + injects into .mcp.json -cgh setup claude # injects key into .mcp.json for Claude Code +cgh init # generates the key and the .codegraph/ index dir ``` -The key file has `600` permissions (owner-only read/write). Never commit it to git. +The key file has `600` permissions and the `.codegraph/` directory is `700` (owner-only). Never commit either to git. --- ## Limitations -- **CALLS resolution is name-based.** If two functions share a name, both get edges. Fully qualified resolution would need type inference, which is out of scope. +- **CALLS resolution is name-based by default.** A call is linked to a same-file function of that name, falling back to all repo functions with that name only when there is no same-file match, so cross-file call edges are best-effort. For Python you can opt into precise cross-file resolution with `pip install cgh[lsp]` and `precise_calls = true` (jedi-backed); other languages stay name-based. - **Terraform HCL uses regex, not a full grammar.** Complex meta-arguments may be missed. - **JS/TS imports resolve to local files only.** Relative imports (`import x from "./utils"`), tsconfig `paths` aliases, and workspace packages do create a `File -> File` IMPORTS edge. Bare external packages (`import react`) are not resolved to a node, and cross-repo edges are not inferred (each federated scope is canonical for its own files). - **Markdown code refs are heuristic.** PascalCase and snake_case patterns are matched, so a ref can be a false positive. diff --git a/codegraph/__main__.py b/codegraph/__main__.py index 17afb29..a14fd46 100644 --- a/codegraph/__main__.py +++ b/codegraph/__main__.py @@ -13,6 +13,7 @@ import argparse import os import sys +from pathlib import Path from rich.panel import Panel from rich.table import Table @@ -23,6 +24,7 @@ from codegraph.cli.commands_graph import cmd_add_dir, cmd_graph, register_graph_parser from codegraph.cli.commands_ensurepath import cmd_ensurepath from codegraph.cli.commands_githooks import cmd_githooks +from codegraph.cli.commands_impact import cmd_impact from codegraph.cli.commands_index import ( cmd_force_index, cmd_index, @@ -60,7 +62,7 @@ ) -def _cmd_serve_owner(args) -> None: +def _cmd_serve_owner(args: argparse.Namespace) -> None: """Internal: run the HTTP-backed owner process (spawned by cgh serve).""" from codegraph.server import owner_main @@ -75,7 +77,9 @@ def _cmd_serve_owner(args) -> None: def _print_help(): """Print a beautiful help screen when no command is given.""" console.print(LOGO) - console.print(f" [dim]v{VERSION}[/dim] [dim]---[/dim] Local code graph index for AI coding assistants\n") + console.print( + f" [dim]v{VERSION}[/dim] [dim]---[/dim] Local code graph index for AI coding assistants\n" + ) sections = [ ( @@ -95,7 +99,10 @@ def _print_help(): ("callers", "Who calls this function? (tree view)"), ("callees", "What does this function call? (tree view)"), ("outline", "Heading tree of a Markdown file"), - ("graph", "Visualize the graph in browser (imports/calls/classes/docs)"), + ( + "graph", + "Visualize the graph in browser (imports/calls/classes/docs)", + ), ], ), ( @@ -106,17 +113,21 @@ def _print_help(): ("logs", "View MCP tool call history"), ("history", "Recent indexing activity grouped by day"), ("diff", "Files changed since last index"), + ("impact", "CI: blast radius + tests for a PR diff (JSON/md)"), ("parsers", "List registered language parsers"), ], ), ( "Maintenance", [ - ("doctor", "Health check --- verify all components are working"), + ("doctor", "Health check: verify all components are working"), ("compact", "Vacuum SQLite DBs and reclaim space"), ("hooks", "Install git hooks that reindex after pull/merge/checkout"), ("ensurepath", "Add the cgh command to your PATH"), - ("migrate-to-duckdb", "Re-index Kuzu repos onto DuckDB (faster + smaller)"), + ( + "migrate-to-duckdb", + "Re-index Kuzu repos onto DuckDB (faster + smaller)", + ), ], ), ( @@ -124,8 +135,14 @@ def _print_help(): [ ("watch", "Index + live-watch for file changes"), ("add-dir", "Manage extra directories in the graph"), - ("federate", "Federate sub-repos (parent queries their indexes read-only)"), - ("force-index", "Index files bypassing .gitignore (requires confirmation)"), + ( + "federate", + "Federate sub-repos (parent queries their indexes read-only)", + ), + ( + "force-index", + "Index files bypassing .gitignore (requires confirmation)", + ), ], ), ] @@ -146,7 +163,9 @@ def _print_help(): console.print(table) console.print() - console.print(" [bold]Usage:[/bold] cgh [cyan][/cyan] [dim][options][/dim]") + console.print( + " [bold]Usage:[/bold] cgh [cyan][/cyan] [dim][options][/dim]" + ) console.print(" [bold]Help:[/bold] cgh [cyan][/cyan] --help") console.print() @@ -195,32 +214,47 @@ class _LogoArgumentParser(argparse.ArgumentParser): def error(self, message: str) -> None: # type: ignore[override] console.print(LOGO) console.print(f"[red]error:[/red] {message}\n") - console.print("[dim]Run[/dim] [cyan]cgh --help[/cyan] [dim]for the full list of commands.[/dim]") + console.print( + "[dim]Run[/dim] [cyan]cgh --help[/cyan] [dim]for the full list of commands.[/dim]" + ) sys.exit(2) + def _add_root(p) -> None: + """Attach the standard --root flag (default: cwd). Every subcommand + takes one; main() then resolves it up to the nearest .codegraph/.""" + p.add_argument("--root", default=os.getcwd()) + ap = _LogoArgumentParser(prog="codegraph", add_help=False) - ap.add_argument("--root", default=os.getcwd()) + _add_root(ap) ap.add_argument("--version", action="store_true") ap.add_argument("-h", "--help", action="store_true") sub = ap.add_subparsers(dest="cmd", parser_class=_LogoArgumentParser) # --- init --- - p = sub.add_parser("init", help="Initialize codegraph in current directory (interactive wizard)") - p.add_argument("--root", default=os.getcwd()) - p.add_argument("--yes", "-y", action="store_true", help="Accept all defaults (non-interactive)") + p = sub.add_parser( + "init", help="Initialize codegraph in current directory (interactive wizard)" + ) + _add_root(p) + p.add_argument( + "--yes", "-y", action="store_true", help="Accept all defaults (non-interactive)" + ) # --- parsers --- sub.add_parser("parsers", help="List registered parsers and supported languages") # --- setup --- p = sub.add_parser("setup", help="Generate integration files for AI tools") - p.add_argument("target", choices=["claude", "cursor", "codex", "gemini", "all"], help="Which AI tool to configure") - p.add_argument("--root", default=os.getcwd()) + p.add_argument( + "target", + choices=["claude", "cursor", "codex", "gemini", "all"], + help="Which AI tool to configure", + ) + _add_root(p) # --- index --- p = sub.add_parser("index", help="Full index / re-index the repository") p.add_argument("--verbose", "-v", action="store_true") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument( "--method", "-m", @@ -245,11 +279,13 @@ def error(self, message: str) -> None: # type: ignore[override] # --- watch --- p = sub.add_parser("watch", help="Index then watch for file changes") p.add_argument("--verbose", "-v", action="store_true") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- serve --- - p = sub.add_parser("serve", help="Start MCP server (stdio proxy to shared HTTP owner)") - p.add_argument("--root", default=os.getcwd()) + p = sub.add_parser( + "serve", help="Start MCP server (stdio proxy to shared HTTP owner)" + ) + _add_root(p) p.add_argument("--watch", action="store_true", help="Enable live file watcher") p.add_argument("--reindex", action="store_true", help="Re-index before serving") p.add_argument( @@ -262,7 +298,7 @@ def error(self, message: str) -> None: # type: ignore[override] # --- _serve_owner (hidden internal subcommand) --- p = sub.add_parser("_serve_owner", help=argparse.SUPPRESS) - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument("--watch", action="store_true") p.add_argument("--reindex", action="store_true") @@ -273,44 +309,75 @@ def error(self, message: str) -> None: # type: ignore[override] # --- stats --- p = sub.add_parser("stats", help="Show graph, edges, call stats, storage") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument("--json", action="store_true", help="Output as JSON") - p.add_argument("--live", action="store_true", help="Refresh stats every 500ms (Ctrl-C to stop)") + p.add_argument( + "--live", action="store_true", help="Refresh stats every 500ms (Ctrl-C to stop)" + ) p = sub.add_parser("tail", help="Live view of scan/watcher activity") - p.add_argument("--root", default=os.getcwd()) - p.add_argument("--limit", "-n", type=int, default=30, help="Number of recent entries (default: 30)") - p.add_argument("--follow", "-f", action="store_true", help="Follow new activity (Ctrl-C to stop)") + _add_root(p) + p.add_argument( + "--limit", + "-n", + type=int, + default=30, + help="Number of recent entries (default: 30)", + ) + p.add_argument( + "--follow", + "-f", + action="store_true", + help="Follow new activity (Ctrl-C to stop)", + ) - p = sub.add_parser("reset", help="Nuke graph + FTS DBs, kill owner, re-index from scratch") - p.add_argument("--root", default=os.getcwd()) + p = sub.add_parser( + "reset", help="Nuke graph + FTS DBs, kill owner, re-index from scratch" + ) + _add_root(p) p.add_argument("--yes", "-y", action="store_true", help="Skip confirmation") - p.add_argument("--drop-extra-dirs", action="store_true", help="Also remove extra_dirs from config.toml") - p.add_argument("--no-reindex", action="store_true", help="Don't re-index after cleaning") + p.add_argument( + "--drop-extra-dirs", + action="store_true", + help="Also remove extra_dirs from config.toml", + ) + p.add_argument( + "--no-reindex", action="store_true", help="Don't re-index after cleaning" + ) # --- migrate-to-duckdb --- p = sub.add_parser( "migrate-to-duckdb", help="Re-index a Kuzu-backed repo into DuckDB, verify counts match, optionally delete graph.db", ) - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument( - "--yes", "-y", action="store_true", + "--yes", + "-y", + action="store_true", help="Skip the 'delete graph.db?' confirmation", ) p.add_argument( - "--keep-kuzu", action="store_true", + "--keep-kuzu", + action="store_true", help="Always keep graph.db even after a clean migration", ) p.add_argument( - "--force", action="store_true", + "--force", + action="store_true", help="Overwrite an existing graph.duckdb (default: abort if present)", ) - p = sub.add_parser("status", help="Owner state, scan freshness, counts, extra_dirs in one glance") - p.add_argument("--root", default=os.getcwd()) + p = sub.add_parser( + "status", help="Owner state, scan freshness, counts, extra_dirs in one glance" + ) + _add_root(p) p.add_argument("--json", action="store_true", help="Output as JSON") - p.add_argument("--workers", action="store_true", help="Also list every worker pid + tty + start time") + p.add_argument( + "--workers", + action="store_true", + help="Also list every worker pid + tty + start time", + ) p.add_argument( "--refresh", action="store_true", @@ -321,83 +388,131 @@ def error(self, message: str) -> None: # type: ignore[override] ), ) - p = sub.add_parser("memory-index", help="Scan the Claude Code memory directory into the FTS index") - p.add_argument("--root", default=os.getcwd()) + p = sub.add_parser( + "memory-index", help="Scan the Claude Code memory directory into the FTS index" + ) + _add_root(p) p.add_argument("--verbose", "-v", action="store_true") p = sub.add_parser("plan-index", help="Scan ~/.claude/plans/ into the FTS index") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument("--verbose", "-v", action="store_true") # --- logs --- p = sub.add_parser("logs", help="View MCP tool call logs") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument("--tool", "-t", help="Filter by tool name") p.add_argument("--errors", "-e", action="store_true", help="Show only errors") - p.add_argument("--limit", "-n", type=int, default=50, help="Max entries (default: 50)") + p.add_argument( + "--limit", "-n", type=int, default=50, help="Max entries (default: 50)" + ) p.add_argument("--json", action="store_true", help="Output as JSON") p.add_argument("--clear", action="store_true", help="Clear all logs") # --- search --- - p = sub.add_parser("grep", help="Regex/substring search across indexed files (ripgrep under the hood)") + p = sub.add_parser( + "grep", + help="Regex/substring search across indexed files (ripgrep under the hood)", + ) p.add_argument("pattern", help="regex (default) or literal (with --fixed)") p.add_argument("--glob", "-g", default="", help="shell glob filter, e.g. '*.py'") p.add_argument("--limit", "-n", type=int, default=50) - p.add_argument("--fixed", "-F", action="store_true", help="literal substring, not regex") + p.add_argument( + "--fixed", "-F", action="store_true", help="literal substring, not regex" + ) p.add_argument("--case", "-s", action="store_true", help="case-sensitive match") p.add_argument("--json", action="store_true") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p = sub.add_parser("search", help="Search symbols by name (fuzzy)") p.add_argument("query", help="Search query") - p.add_argument("--root", default=os.getcwd()) - p.add_argument("--limit", "-n", type=int, default=100, help="Page size (default: 100)") - p.add_argument("--offset", "-o", type=int, default=0, help="Skip first N results (for pagination)") + _add_root(p) + p.add_argument( + "--limit", "-n", type=int, default=100, help="Page size (default: 100)" + ) + p.add_argument( + "--offset", + "-o", + type=int, + default=0, + help="Skip first N results (for pagination)", + ) p.add_argument("--json", action="store_true") # --- lookup --- p = sub.add_parser("lookup", help="Find where a symbol is defined") p.add_argument("name", help="Symbol name") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- callers --- p = sub.add_parser("callers", help="Find all callers of a function (tree view)") p.add_argument("fn_name", help="Function name") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- callees --- - p = sub.add_parser("callees", help="Find all functions called by a function (tree view)") + p = sub.add_parser( + "callees", help="Find all functions called by a function (tree view)" + ) p.add_argument("fn_name", help="Function name") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- outline --- p = sub.add_parser("outline", help="Show heading outline of a Markdown file (tree)") p.add_argument("file", help="Markdown file path") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- doctor --- - p = sub.add_parser("doctor", help="Health check --- verify all codegraph components") - p.add_argument("--root", default=os.getcwd()) + p = sub.add_parser("doctor", help="Health check: verify all codegraph components") + _add_root(p) # --- diff --- p = sub.add_parser("diff", help="Show files changed since last index") - p.add_argument("--root", default=os.getcwd()) - p.add_argument("--since", default="HEAD", help="Git ref to diff against (default: HEAD)") + _add_root(p) + p.add_argument( + "--since", default="HEAD", help="Git ref to diff against (default: HEAD)" + ) + + # --- impact (CI mode: blast radius + tests for a PR diff) --- + p = sub.add_parser( + "impact", + help="CI: blast radius + tests for files changed since a git ref", + ) + _add_root(p) + p.add_argument( + "--since", + default="HEAD~1", + help="Git ref to diff the working tree against (default: HEAD~1)", + ) + p.add_argument( + "--json", action="store_true", help="Emit JSON (shorthand for --format json)" + ) + p.add_argument( + "--format", + choices=["md", "json"], + default="md", + help="Output format: md (PR comment) or json (default: md). " + "The graph index should be fresh: run `cgh index` first in CI.", + ) # --- history --- p = sub.add_parser("history", help="Show recent indexing activity by day") - p.add_argument("--root", default=os.getcwd()) - p.add_argument("--days", "-d", type=int, default=7, help="Number of days to show (default: 7)") + _add_root(p) + p.add_argument( + "--days", "-d", type=int, default=7, help="Number of days to show (default: 7)" + ) # --- compact --- p = sub.add_parser("compact", help="Vacuum SQLite DBs and show before/after sizes") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- graph + add-dir --- register_graph_parser(sub) # --- federate --- - p = sub.add_parser("federate", help="Manage federated subrepos (parent queries their indexes read-only)") + p = sub.add_parser( + "federate", + help="Manage federated subrepos (parent queries their indexes read-only)", + ) p.add_argument( "action", nargs="?", @@ -406,17 +521,22 @@ def error(self, message: str) -> None: # type: ignore[override] help="Action (default: list)", ) p.add_argument("paths", nargs="*", help="Subrepo paths (for add / remove)") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- force-index --- - p = sub.add_parser("force-index", help="Force-index files/dirs (bypasses .gitignore)") + p = sub.add_parser( + "force-index", help="Force-index files/dirs (bypasses .gitignore)" + ) p.add_argument("paths", nargs="+", help="Files or directories") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) p.add_argument("--verbose", "-v", action="store_true") p.add_argument("--yes", "-y", action="store_true", help="Skip confirmation") # --- hooks --- - p = sub.add_parser("hooks", help="Manage git hooks that refresh the graph after pull/merge/checkout/rebase") + p = sub.add_parser( + "hooks", + help="Manage git hooks that refresh the graph after pull/merge/checkout/rebase", + ) p.add_argument( "action", nargs="?", @@ -429,15 +549,19 @@ def error(self, message: str) -> None: # type: ignore[override] action="store_true", help="Allow install into a shared core.hooksPath (affects every repo)", ) - p.add_argument("--root", default=os.getcwd()) + _add_root(p) # --- ensurepath --- - p = sub.add_parser("ensurepath", help="Add the cgh command's directory to your PATH") - p.add_argument("--yes", "-y", action="store_true", help="Skip the confirmation prompt") + p = sub.add_parser( + "ensurepath", help="Add the cgh command's directory to your PATH" + ) + p.add_argument( + "--yes", "-y", action="store_true", help="Skip the confirmation prompt" + ) # --- _reindex_hook (internal: invoked by the git hooks) --- p = sub.add_parser("_reindex_hook") - p.add_argument("--root", default=os.getcwd()) + _add_root(p) args = ap.parse_args() @@ -445,6 +569,25 @@ def error(self, message: str) -> None: # type: ignore[override] _print_help() return + # Resolve the codegraph root by walking up to the nearest .codegraph/, the + # way git finds its repo root via .git. This lets every command work from + # a subdirectory of an initialized repo. init/setup create in the literal + # directory, and _serve_owner / _reindex_hook get an explicit root from + # their spawner, so those opt out. The hint goes to stderr to keep stdout + # clean for --json output and piping. + _NO_ROOT_WALK = {"init", "setup", "_serve_owner", "_reindex_hook"} + if args.cmd not in _NO_ROOT_WALK and getattr(args, "root", None): + from codegraph.core.config import find_codegraph_root + + discovered = find_codegraph_root(args.root) + if discovered is not None and discovered != Path(args.root).resolve(): + from rich.console import Console as _Console + + _Console(stderr=True).print( + f"[dim]Using codegraph root: {discovered}[/dim]" + ) + args.root = str(discovered) + dispatch = { "init": cmd_init, "setup": cmd_setup, @@ -471,6 +614,7 @@ def error(self, message: str) -> None: # type: ignore[override] "outline": cmd_outline, "doctor": cmd_doctor, "diff": cmd_diff, + "impact": cmd_impact, "history": cmd_history, "compact": cmd_compact, "graph": cmd_graph, diff --git a/codegraph/analysis/churn.py b/codegraph/analysis/churn.py new file mode 100644 index 0000000..01711db --- /dev/null +++ b/codegraph/analysis/churn.py @@ -0,0 +1,249 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Git-history churn analysis. Pure functions over `git log`, no +# MCP and no graph DB. file_churn aggregates per-file commit +# counts, recency, authors, and line deltas from the numstat +# log. file_ownership rolls up the top authors for one file. +# Results are cached per (repo_root, git HEAD) so a long-running +# owner process does not re-shell out on every query. Every +# git call has a timeout and degrades to an empty result when +# git is absent or the command fails. + +from __future__ import annotations + +import subprocess +from pathlib import Path + +# How many commits back we walk by default. git log is slow on large repos, +# so we bound the history. Callers see this cap in the tool `note`. +DEFAULT_COMMIT_CAP = 2000 + +# Per-file ownership log is cheaper (one path), but still bounded. +OWNERSHIP_COMMIT_CAP = 500 + +_GIT_TIMEOUT = 30 + +# Cache keyed by (resolved_repo_root, head_sha, since). Keeps a long-lived +# owner process from re-running git log on every hotspots call. head_sha is +# part of the key so a new commit invalidates the entry naturally. +_CHURN_CACHE: dict[tuple[str, str | None, str | None], dict[str, dict]] = {} +_OWNERSHIP_CACHE: dict[tuple[str, str | None, str], list[dict]] = {} + + +def _git(repo_root: str | Path, *args: str) -> str | None: + """Run `git ` in repo_root. Return raw stdout or None on failure. + Mirrors the subprocess style in state/scan_meta.py.""" + try: + r = subprocess.run( + ["git", *args], + capture_output=True, + text=True, + encoding="utf-8", + errors="replace", + cwd=str(repo_root), + timeout=_GIT_TIMEOUT, + ) + if r.returncode == 0: + return r.stdout + except (subprocess.TimeoutExpired, FileNotFoundError, OSError): + pass + return None + + +def _head_sha(repo_root: str | Path) -> str | None: + out = _git(repo_root, "rev-parse", "HEAD") + return out.strip() if out else None + + +def file_churn( + repo_root: str | Path, + since: str | None = None, + commit_cap: int = DEFAULT_COMMIT_CAP, +) -> dict[str, dict]: + """Aggregate per-file change history from the git log. + + Walks `git log --numstat` over at most `commit_cap` commits (or a + `since` window, e.g. "3 months ago" or a ref), and returns a mapping of + repo-relative file path to: + + commits number of commits that touched the file + last_modified max author-time seen (unix seconds, int) + authors {author_name: commit_count} that touched the file + lines_added total added lines (binary diffs skipped) + lines_deleted total deleted lines + + Returns {} when the path is not a git repo, git is unavailable, or the + command fails. Results are cached per (repo_root, HEAD, since). + """ + root = Path(repo_root).resolve() + head = _head_sha(root) + key = (str(root), head, since) + cached = _CHURN_CACHE.get(key) + if cached is not None: + return cached + + args = ["log", "--no-merges", "--numstat", "--format=commit\x1f%H\x1f%an\x1f%at"] + if since: + args.append(f"--since={since}") + else: + args.append(f"--max-count={int(commit_cap)}") + + out = _git(root, *args) + if out is None: + # Do not cache a transient failure: a later call might succeed. + return {} + + result: dict[str, dict] = {} + cur_author = "" + cur_time = 0 + for line in out.splitlines(): + if not line: + continue + if line.startswith("commit\x1f"): + parts = line.split("\x1f") + # parts = ["commit", sha, author, author_time] + cur_author = parts[2] if len(parts) > 2 else "" + try: + cur_time = int(parts[3]) if len(parts) > 3 else 0 + except ValueError: + cur_time = 0 + continue + # numstat line: "\t\t". Binary files show "-". + cols = line.split("\t") + if len(cols) != 3: + continue + added_s, deleted_s, path = cols + path = _strip_rename(path) + if not path: + continue + entry = result.get(path) + if entry is None: + entry = { + "commits": 0, + "last_modified": 0, + "authors": {}, + "lines_added": 0, + "lines_deleted": 0, + } + result[path] = entry + entry["commits"] += 1 + if cur_time > entry["last_modified"]: + entry["last_modified"] = cur_time + if cur_author: + entry["authors"][cur_author] = entry["authors"].get(cur_author, 0) + 1 + if added_s != "-": + try: + entry["lines_added"] += int(added_s) + except ValueError: + pass + if deleted_s != "-": + try: + entry["lines_deleted"] += int(deleted_s) + except ValueError: + pass + + _CHURN_CACHE[key] = result + return result + + +def _strip_rename(path: str) -> str: + """Normalise a numstat path. Git emits renames as either + "old => new" or "dir/{old => new}/file"; we keep the new (current) path + so the churn keys match the working tree.""" + if "=>" not in path: + return path.strip() + # Brace form: prefix{old => new}suffix + if "{" in path and "}" in path: + pre, rest = path.split("{", 1) + mid, post = rest.split("}", 1) + new = mid.split("=>", 1)[1].strip() + combined = f"{pre}{new}{post}".replace("//", "/") + return combined.strip() + # Plain form: old => new + return path.split("=>", 1)[1].strip() + + +def file_ownership( + repo_root: str | Path, + file_path: str | Path, + commit_cap: int = OWNERSHIP_COMMIT_CAP, +) -> list[dict]: + """Top authors for a single file, by commit count then recency. + + Uses `git log --format=%an|%at -- ` (cheaper than blame and good + enough for an ownership signal). `file_path` may be absolute or + repo-relative; git resolves it against the repo. Returns a list of + {name, commits, last_commit} sorted by commit count descending, then by + most recent commit. Returns [] on any failure or when git is absent. + Cached per (repo_root, HEAD, file). + """ + root = Path(repo_root).resolve() + head = _head_sha(root) + # Normalise to a repo-relative POSIX path when the file lives under the + # repo, so absolute and relative callers share a cache entry. + rel = _to_repo_relative(root, file_path) + key = (str(root), head, rel) + cached = _OWNERSHIP_CACHE.get(key) + if cached is not None: + return cached + + out = _git( + root, + "log", + f"--max-count={int(commit_cap)}", + "--format=%an\x1f%at", + "--", + rel, + ) + if out is None: + return [] + + tally: dict[str, dict] = {} + for line in out.splitlines(): + if not line: + continue + parts = line.split("\x1f") + name = parts[0] if parts else "" + if not name: + continue + try: + ts = int(parts[1]) if len(parts) > 1 else 0 + except ValueError: + ts = 0 + rec = tally.get(name) + if rec is None: + tally[name] = {"name": name, "commits": 1, "last_commit": ts} + else: + rec["commits"] += 1 + if ts > rec["last_commit"]: + rec["last_commit"] = ts + + ranked = sorted( + tally.values(), + key=lambda r: (r["commits"], r["last_commit"]), + reverse=True, + ) + _OWNERSHIP_CACHE[key] = ranked + return ranked + + +def _to_repo_relative(root: Path, file_path: str | Path) -> str: + """Return a POSIX repo-relative path when file_path is under root, + otherwise return the path as given (git handles the rest).""" + p = Path(file_path) + if p.is_absolute(): + try: + return p.resolve().relative_to(root).as_posix() + except (ValueError, OSError): + return str(file_path) + return Path(file_path).as_posix() + + +def clear_cache() -> None: + """Drop the per-HEAD churn / ownership caches. Mostly for tests.""" + _CHURN_CACHE.clear() + _OWNERSHIP_CACHE.clear() diff --git a/codegraph/analysis/endpoints.py b/codegraph/analysis/endpoints.py index 9d0afe6..0e617bd 100644 --- a/codegraph/analysis/endpoints.py +++ b/codegraph/analysis/endpoints.py @@ -5,9 +5,10 @@ # __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" # -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # Description: Extract HTTP endpoint definitions from source files. -# Supports FastAPI / Flask / Starlette decorators (Python), -# Nuxt server/api file-based routes, and Express/Fastify .route() -# calls (JS/TS). +# Supports FastAPI / Flask / Starlette decorators and Django +# urlpatterns (Python), Nuxt server/api file-based routes plus +# Express/Fastify and NestJS decorators (JS/TS), Spring +# @*Mapping decorators (Java), and Gin/Echo router calls (Go). from __future__ import annotations @@ -100,11 +101,73 @@ def _build_py_endpoint( ) +# --------------------------------------------------------------------------- +# Django, path() / re_path() entries in urls.py urlpatterns +# --------------------------------------------------------------------------- + +# path("donations//", views.detail, name="detail") +# re_path(r"^donations/$", DonationList.as_view()) +_DJANGO_ROUTE = re.compile( + r"""\b(?Ppath|re_path)\s*\( + \s*(?:r)?['"](?P[^'"]*)['"] # the route pattern + \s*,\s*(?P[^,)\n]+) # the view reference + """, + re.VERBOSE, +) + + +def extract_django(path: str | Path, src: str) -> list[EndpointDef]: + """Django URL routing, path() / re_path() in a urls.py urlpatterns list. + + Django does not pin a method at the route, so we emit method ANY. The + handler is the view callable (function or `View.as_view()` class name). + """ + p = Path(path) + if p.name != "urls.py": + return [] + + out: list[EndpointDef] = [] + for i, line in enumerate(src.splitlines(), start=1): + m = _DJANGO_ROUTE.search(line) + if not m: + continue + url = m.group("path") + # Normalise the leading slash so paths read the same as other frameworks + norm = url if url.startswith("/") else "/" + url + # Strip a Django regex anchor so /^donations/$ reads as /donations/ + norm = norm.lstrip("/^").rstrip("$") + norm = "/" + norm + + view = m.group("view").strip() + handler = view + as_view = re.search(r"([A-Za-z_]\w*)\s*\.as_view\s*\(", view) + if as_view: + handler = as_view.group(1) + else: + # views.detail -> detail, app.views.detail -> detail + handler = view.split(".")[-1] + + out.append( + EndpointDef( + id=f"{path}::ANY::{norm}", + method="ANY", + path=norm, + framework="django", + file_path=str(path), + start_line=i, + handler_name=handler or None, + ) + ) + return out + + # --------------------------------------------------------------------------- # Nuxt, file-based routes under server/api/ # --------------------------------------------------------------------------- -_NUXT_METHOD_SUFFIX = re.compile(r"\.(get|post|put|patch|delete|head|options)\.(ts|js|mjs)$", re.IGNORECASE) +_NUXT_METHOD_SUFFIX = re.compile( + r"\.(get|post|put|patch|delete|head|options)\.(ts|js|mjs)$", re.IGNORECASE +) def extract_nuxt(path: str | Path, src: str) -> list[EndpointDef]: @@ -189,6 +252,166 @@ def extract_express(path: str | Path, src: str) -> list[EndpointDef]: return out +# --------------------------------------------------------------------------- +# NestJS, @Get('x') / @Post('x') controller method decorators (TS) +# --------------------------------------------------------------------------- + +# @Get(), @Get('profile'), @Post("login"), @Delete(':id') +_NEST_DECORATOR = re.compile( + r"""@(?PGet|Post|Put|Patch|Delete|Head|Options|All) + \s*\(\s* + (?:['"`](?P[^'"`]*)['"`])? # optional path argument + \s*\) + """, + re.VERBOSE, +) + + +def extract_nest(path: str | Path, src: str) -> list[EndpointDef]: + """NestJS controller route decorators. Path defaults to "/" when the + decorator is called with no argument (`@Get()`). The handler is the + method name on the line that follows the decorator.""" + out: list[EndpointDef] = [] + lines = src.splitlines() + for i, line in enumerate(lines, start=1): + m = _NEST_DECORATOR.search(line) + if not m: + continue + method = m.group("method") + if method == "All": + method = "ANY" + sub = m.group("path") or "" + route = "/" + sub.strip("/") if sub else "/" + + handler = None + for j in range(i, min(i + 5, len(lines))): + ln = lines[j].lstrip() + if ln.startswith("@"): + continue + hm = re.match( + r"(?:public\s+|private\s+|protected\s+|async\s+)*([A-Za-z_]\w*)\s*\(", + ln, + ) + if hm: + handler = hm.group(1) + break + + out.append( + EndpointDef( + id=f"{path}::{method.upper()}::{route}", + method=method.upper(), + path=route, + framework="nestjs", + file_path=str(path), + start_line=i, + handler_name=handler, + ) + ) + return out + + +# --------------------------------------------------------------------------- +# Spring, @GetMapping / @RequestMapping(method = RequestMethod.POST) (Java) +# --------------------------------------------------------------------------- + +# @GetMapping("/users"), @PostMapping(value = "/users"), @RequestMapping("/x") +_SPRING_MAPPING = re.compile( + r"""@(?PGet|Post|Put|Patch|Delete|Request)Mapping + \s*\( + (?P[^)]*) + \) + """, + re.VERBOSE, +) +_SPRING_PATH = re.compile(r"""(?:value\s*=\s*|path\s*=\s*)?['"]([^'"]+)['"]""") +_SPRING_METHOD = re.compile(r"RequestMethod\.(GET|POST|PUT|PATCH|DELETE|HEAD|OPTIONS)") + + +def extract_spring(path: str | Path, src: str) -> list[EndpointDef]: + """Spring MVC mapping annotations. @RequestMapping infers the method from + a `method = RequestMethod.X` argument, defaulting to ANY when absent.""" + out: list[EndpointDef] = [] + lines = src.splitlines() + for i, line in enumerate(lines, start=1): + m = _SPRING_MAPPING.search(line) + if not m: + continue + ann = m.group("ann") + args = m.group("args") + + pm = _SPRING_PATH.search(args) + route = pm.group(1) if pm else "/" + + if ann == "Request": + mm = _SPRING_METHOD.search(args) + method = mm.group(1) if mm else "ANY" + else: + method = ann.upper() + + handler = None + for j in range(i, min(i + 5, len(lines))): + ln = lines[j].lstrip() + if ln.startswith("@"): + continue + hm = re.search(r"\b([A-Za-z_]\w*)\s*\(", ln) + if hm: + handler = hm.group(1) + break + + out.append( + EndpointDef( + id=f"{path}::{method}::{route}", + method=method, + path=route, + framework="spring", + file_path=str(path), + start_line=i, + handler_name=handler, + ) + ) + return out + + +# --------------------------------------------------------------------------- +# Gin / Echo, r.GET("/path", handler) (Go) +# --------------------------------------------------------------------------- + +# r.GET("/users", listUsers) / e.POST("/users", h.Create) / group.DELETE(...) +_GO_ROUTE = re.compile( + r"""\b(?P[A-Za-z_]\w*) + \.(?PGET|POST|PUT|PATCH|DELETE|HEAD|OPTIONS|Any) + \s*\(\s*['"](?P[^'"]+)['"] + \s*,\s*(?P[A-Za-z_][\w.]*) + """, + re.VERBOSE, +) + + +def extract_go(path: str | Path, src: str) -> list[EndpointDef]: + """Gin and Echo router calls. Both expose `.METHOD(path, handler)`, + so a single pattern covers them. The handler is the last dotted segment.""" + out: list[EndpointDef] = [] + for i, line in enumerate(src.splitlines(), start=1): + m = _GO_ROUTE.search(line) + if not m: + continue + method = m.group("method") + method = "ANY" if method == "Any" else method.upper() + handler = m.group("handler").split(".")[-1] + out.append( + EndpointDef( + id=f"{path}::{method}::{m.group('path')}", + method=method, + path=m.group("path"), + framework="gin", + file_path=str(path), + start_line=i, + handler_name=handler or None, + ) + ) + return out + + # --------------------------------------------------------------------------- # Dispatcher # --------------------------------------------------------------------------- @@ -198,11 +421,21 @@ def extract(path: str | Path, src: str) -> list[EndpointDef]: p = Path(path) suffix = p.suffix.lower() if suffix == ".py": - return extract_python(p, src) + eps = extract_python(p, src) + eps.extend(extract_django(p, src)) + return eps if suffix in (".ts", ".tsx", ".js", ".mjs"): - # Nuxt first (path-based), then a best-effort express scan + # Nuxt first (path-based), then NestJS decorators, then a best-effort + # express scan. NestJS and Express rarely co-occur in one file. nuxt = extract_nuxt(p, src) if nuxt: return nuxt + nest = extract_nest(p, src) + if nest: + return nest return extract_express(p, src) + if suffix == ".java": + return extract_spring(p, src) + if suffix == ".go": + return extract_go(p, src) return [] diff --git a/codegraph/analysis/federation.py b/codegraph/analysis/federation.py index c71c278..496a093 100644 --- a/codegraph/analysis/federation.py +++ b/codegraph/analysis/federation.py @@ -12,6 +12,7 @@ from __future__ import annotations +import os import sqlite3 from collections.abc import Callable, Iterator from contextlib import contextmanager @@ -108,16 +109,33 @@ def child_paths_to_skip(repo_root: str | Path) -> list[Path]: def is_under_any(path: str | Path, roots: list[Path]) -> bool: - """Return True if `path` lives under any of `roots` (inclusive).""" + """Return True if `path` lives under any of `roots` (inclusive). + + Both sides are resolved and case-normalized before comparison. This matters + on Windows: the filesystem is case-insensitive and `resolve()` may change + casing or 8.3 short-names, while a candidate built as `root / "a/b.tf"` from + a git ls-files path mixes separators. The previous version only resolved the + roots (via resolve_children) and left an already-absolute candidate + untouched, so on Windows the case/short-name mismatch made every federated + subrepo fail to match, and none of their files were skipped from the parent + scan. pathlib's relative_to is case-sensitive, so we compare normcase'd + strings on a path-separator boundary instead. + """ if not roots: return False - p = Path(path).resolve() if not Path(path).is_absolute() else Path(path) - for root in roots: + + def _norm(pp: Path) -> str: try: - p.relative_to(root) + pp = pp.resolve() + except OSError: + pp = pp.absolute() + return os.path.normcase(str(pp)) + + p_norm = _norm(Path(path)) + for root in roots: + r_norm = _norm(Path(root)) + if p_norm == r_norm or p_norm.startswith(r_norm + os.sep): return True - except ValueError: - continue return False @@ -252,7 +270,9 @@ def open_fts_ro(repo_root: Path) -> Iterator[sqlite3.Connection | None]: # written to by their own owners while we read. from codegraph.core.utils import ro_sqlite_uri - conn = sqlite3.connect(ro_sqlite_uri(db_path), uri=True, check_same_thread=False) + conn = sqlite3.connect( + ro_sqlite_uri(db_path), uri=True, check_same_thread=False + ) yield conn except sqlite3.Error: yield None @@ -355,6 +375,52 @@ def _run_one_graphdb( ) +def federate_scoped( + get_parent_conn: Callable[[], GraphDB], + repo_root: str | Path | None, + query_fn: Callable[[GraphDB], Any], +) -> tuple[list[tuple[str, list]], list[dict]]: + """Run ``query_fn(conn)`` against the parent's in-process write conn and + each federated child's RO graph DB. + + Returns ``(scoped, warnings)`` where ``scoped`` is ``[(scope, payload), …]`` + (scope "parent" first, then each child by name) and ``warnings`` is + ``[{scope, error}, …]`` for any scope whose query failed. + + This is the one place the parent+children fan-out lives; MCP tool modules + call it instead of each re-implementing the loop. + """ + scoped: list[tuple[str, list]] = [] + warnings: list[dict] = [] + try: + scoped.append(("parent", query_fn(get_parent_conn()) or [])) + except Exception as exc: + warnings.append({"scope": "parent", "error": f"{type(exc).__name__}: {exc}"}) + if repo_root is not None: + for s in for_each_child_graphdb(repo_root, lambda c, _r: query_fn(c)): + if s.error: + warnings.append({"scope": s.scope, "error": s.error}) + continue + scoped.append((s.scope, s.payload or [])) + return scoped, warnings + + +def federate_flat( + get_parent_conn: Callable[[], GraphDB], + repo_root: str | Path | None, + query_fn: Callable[[GraphDB], Any], +) -> tuple[list[dict], list[dict]]: + """Flattened view of :func:`federate_scoped`: every payload row gets a + ``"scope"`` key and all rows land in one list. Returns ``(rows, warnings)``.""" + scoped, warnings = federate_scoped(get_parent_conn, repo_root, query_fn) + rows: list[dict] = [] + for scope, payload in scoped: + for item in payload: + item["scope"] = scope + rows.append(item) + return rows, warnings + + # Backward-compat aliases for callers that import the old names. New # code should prefer the _graphdb variants, these will be removed in # the 0.6 release that also deletes the Kuzu-specific code paths. @@ -367,7 +433,7 @@ def for_each_fts( repo_root: str | Path, fn: Callable[[sqlite3.Connection, Path], Any], ) -> list[ScopedResult]: - """Same as for_each_kuzu but for the FTS sqlite databases.""" + """Same as for_each_graphdb but for the FTS sqlite databases.""" parent = Path(repo_root).resolve() return [_run_one_fts(root, parent, fn) for root in iter_db_roots(parent)] @@ -438,7 +504,9 @@ def child_owner_status(child_path: str | Path) -> OwnerStatus: # --------------------------------------------------------------------------- -def add_subrepo(repo_root: str | Path, child_path: str | Path) -> tuple[Path, ChildStatus]: +def add_subrepo( + repo_root: str | Path, child_path: str | Path +) -> tuple[Path, ChildStatus]: """ Append `child_path` to the parent's config.toml and return its status. Idempotent, if already present, just returns the current status. diff --git a/codegraph/analysis/impact.py b/codegraph/analysis/impact.py new file mode 100644 index 0000000..463f1ea --- /dev/null +++ b/codegraph/analysis/impact.py @@ -0,0 +1,295 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Pure GraphDB-protocol helpers shared by the test-mapping MCP +# tools (tests_for / untested) and the `cgh impact` CI command. +# Computes test-to-code mapping on the fly from IMPORTS + CALLS +# edges plus File.role, with no new edge type, plus a bounded +# reverse-BFS over IMPORTS for blast radius. Backend-neutral: +# every call goes through the GraphDB protocol, no raw SQL. + +from __future__ import annotations + +from typing import Any + +# Hard caps so a pathological graph never produces an unbounded result. +TEST_ROLE = "test" +_FANOUT_CAP = 500 +_REVERSE_CAP = 300 + + +def _is_test_role(role: str | None) -> bool: + """A File node is a test when roles.classify tagged it `test`.""" + return (role or "") == TEST_ROLE + + +def file_role(conn: Any, file_path: str) -> tuple[str, str]: + """Return (role, layer) for a File node, or ("", "") when absent.""" + rows = conn.find_nodes( + "File", + where={"path": file_path}, + return_fields=["role", "layer"], + limit=1, + ) + if not rows: + return "", "" + return rows[0].get("role") or "", rows[0].get("layer") or "" + + +def resolve_target_file(conn: Any, target: str) -> str | None: + """Resolve a symbol-or-file argument to a defining File path. + + - If ``target`` is itself a File node path, return it. + - Else treat it as a Function / Class name and return the path of the + first defining file found. + + Returns None when nothing matches. + """ + hits = conn.find_nodes("File", where={"path": target}, limit=1) + if hits: + return target + for label in ("Function", "Class"): + rows = conn.find_nodes( + label, + where={"name": target}, + return_fields=["file_path"], + limit=1, + ) + if rows and rows[0].get("file_path"): + return rows[0]["file_path"] + return None + + +def tests_for_file(conn: Any, target_file: str) -> list[dict[str, str]]: + """Test files that import ``target_file`` directly. + + Inferred heuristic: a test file is a File node whose role is `test`, and + that has an IMPORTS edge into the target file. Returns + ``[{"file", "role"}]`` (de-duplicated, order-preserving). + """ + seen: set[str] = set() + out: list[dict[str, str]] = [] + for row in conn.find_neighbors( + "IMPORTS", + dst_key=target_file, + return_src=["path", "role"], + limit=_FANOUT_CAP, + ): + path = row.get("src_path") + role = row.get("src_role") or "" + if not path or path in seen: + continue + if not _is_test_role(role): + continue + seen.add(path) + out.append({"file": path, "role": role}) + return out + + +def tests_calling_symbol(conn: Any, symbol: str) -> list[dict[str, str]]: + """Test files that hold a function whose CALLS reach ``symbol``. + + Inferred heuristic: find Function nodes named ``symbol``, walk CALLS + backward one hop, and keep callers that live in a `test`-role file. + CALLS edges are same-file-scoped (per BUG-2) and name-matched, so this is + a candidate set, not ground truth. Returns ``[{"file", "role"}]``. + """ + target_ids = [ + r["id"] + for r in conn.find_nodes( + "Function", where={"name": symbol}, return_fields=["id"] + ) + ] + seen: set[str] = set() + out: list[dict[str, str]] = [] + for tid in target_ids: + for row in conn.find_neighbors( + "CALLS", + dst_key=tid, + return_src=["id"], + limit=_FANOUT_CAP, + ): + caller_id = row.get("src_id") or "" + caller_file = caller_id.rsplit("::", 1)[0] if "::" in caller_id else "" + if not caller_file or caller_file in seen: + continue + role, _ = file_role(conn, caller_file) + if not _is_test_role(role): + continue + seen.add(caller_file) + out.append({"file": caller_file, "role": role}) + return out + + +def tests_for(conn: Any, target: str) -> dict[str, Any]: + """Full test-to-code mapping for one scope. + + Resolves ``target`` to a defining file, collects importing test files, + and (when ``target`` is a symbol) test files whose calls reach it. + Returns ``{target, target_file, tests: [{file, role}], count}``. When the + target cannot be resolved, ``target_file`` is None and ``tests`` is empty. + """ + target_file = resolve_target_file(conn, target) + if target_file is None: + return {"target": target, "target_file": None, "tests": [], "count": 0} + + seen: set[str] = set() + tests: list[dict[str, str]] = [] + for t in tests_for_file(conn, target_file): + if t["file"] in seen: + continue + seen.add(t["file"]) + tests.append(t) + + # When the argument named a symbol (not the file itself), also follow + # the call graph from that symbol into test functions. + if target != target_file: + for t in tests_calling_symbol(conn, target): + if t["file"] in seen: + continue + seen.add(t["file"]) + tests.append(t) + + return { + "target": target, + "target_file": target_file, + "tests": tests, + "count": len(tests), + } + + +def untested_files( + conn: Any, + role: str = "", + layer: str = "", + cap: int = 200, +) -> tuple[list[dict[str, str]], bool]: + """Non-test source files that no test file imports. + + Walks every File node, skips test / doc files, applies the optional + role / layer filter, and keeps those with no `test`-role importer. + Returns ``(rows, truncated)`` where each row is ``{file, role, layer}``. + """ + where: dict[str, Any] = {} + if role: + where["role"] = role + if layer: + where["layer"] = layer + + files = conn.find_nodes( + "File", + where=where or None, + return_fields=["path", "role", "layer"], + order_by=["path"], + ) + + out: list[dict[str, str]] = [] + truncated = False + for f in files: + path = f.get("path") + frole = f.get("role") or "" + flayer = f.get("layer") or "" + if not path: + continue + # Never report test or doc files as "untested". + if frole in (TEST_ROLE, "doc"): + continue + if tests_for_file(conn, path): + continue + out.append({"file": path, "role": frole, "layer": flayer}) + if len(out) >= cap: + truncated = True + break + return out, truncated + + +def reverse_import_bfs( + conn: Any, + start_files: list[str], + max_depth: int = 3, +) -> tuple[list[str], bool]: + """Bounded reverse BFS over IMPORTS: every file that transitively imports + any of ``start_files`` within ``max_depth`` hops. + + Returns ``(ordered_file_paths, truncated)``. ``start_files`` themselves are + not included in the result. Caps both the per-node fan-out and the total + result size so a hub file cannot blow up the walk. + """ + seen: set[str] = set(start_files) + frontier = list(start_files) + ordered: list[str] = [] + truncated = False + depth = 0 + while frontier and depth < max(1, int(max_depth)): + depth += 1 + nxt: list[str] = [] + for key in frontier: + rows = conn.find_neighbors( + "IMPORTS", + dst_key=key, + return_src=["path"], + limit=_FANOUT_CAP, + ) + if len(rows) >= _FANOUT_CAP: + truncated = True + for r in rows: + src = r.get("src_path") + if not src or src in seen: + continue + seen.add(src) + ordered.append(src) + nxt.append(src) + if len(ordered) >= _REVERSE_CAP: + return ordered, True + frontier = nxt + return ordered, truncated + + +def symbols_in_file(conn: Any, file_path: str) -> list[dict[str, str]]: + """Functions and classes defined in ``file_path``. + + Returns ``[{name, kind, lines}]`` ordered by start line. Used by the + impact command to report which symbols actually changed in a diff. + """ + out: list[dict[str, str]] = [] + for label, kind in (("Function", "function"), ("Class", "class")): + for s in conn.find_nodes( + label, + where={"file_path": file_path}, + return_fields=["name", "start_line", "end_line"], + order_by=["start_line"], + ): + out.append( + { + "name": s.get("name", ""), + "kind": kind, + "lines": f"{s.get('start_line', '')}-{s.get('end_line', '')}", + } + ) + return out + + +def endpoints_in_files(conn: Any, files: list[str]) -> list[dict[str, str]]: + """Endpoints declared (DEFINES_ENDPOINT) in any of ``files``. + + Returns ``[{file, method, path}]``, de-duplicated. + """ + seen: set[tuple[str, str, str]] = set() + out: list[dict[str, str]] = [] + for fp in files: + for e in conn.find_neighbors( + "DEFINES_ENDPOINT", + src_key=fp, + return_dst=["method", "path"], + ): + method = e.get("dst_method", "") or "" + path = e.get("dst_path", "") or "" + key = (fp, method, path) + if key in seen: + continue + seen.add(key) + out.append({"file": fp, "method": method, "path": path}) + return out diff --git a/codegraph/analysis/pattern.py b/codegraph/analysis/pattern.py index 9cf2c9b..0b66de2 100644 --- a/codegraph/analysis/pattern.py +++ b/codegraph/analysis/pattern.py @@ -116,8 +116,9 @@ def _run_rg( args.append("--fixed-strings") if glob: args.extend(["--glob", glob]) - args.append(pattern) - args.append(str(root)) + # "--" stops the pattern from being parsed as a flag: without it a + # pattern like "--pre=sh" reaches ripgrep's preprocessor (code exec). + args.extend(["--", pattern, str(root)]) try: r = subprocess.run(args, capture_output=True, text=True, encoding="utf-8", errors="replace", timeout=30) except (subprocess.TimeoutExpired, OSError): @@ -169,7 +170,8 @@ def _run_git_grep( args.append("-F") else: args.append("-E") - args.append(pattern) + # "-e " so a pattern starting with "-" is never read as a flag. + args.extend(["-e", pattern]) if glob: args.extend(["--", glob]) try: diff --git a/codegraph/analysis/precise_calls.py b/codegraph/analysis/precise_calls.py new file mode 100644 index 0000000..b939d87 --- /dev/null +++ b/codegraph/analysis/precise_calls.py @@ -0,0 +1,243 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Opt-in precise CALLS resolver for Python (proof of concept). +# Uses jedi (the static engine python-lsp-server wraps) to do +# goto-definition on every call site in a file, then maps each +# resolved definition to codegraph's Function id scheme +# ("{file}::{name}" or "{file}::{Class}.{method}"). Only targets +# that resolve to a file INSIDE repo_root are kept; stdlib and +# site-packages are dropped. jedi is imported lazily and every +# failure degrades to an empty result, so the core install never +# depends on it. The public shape (resolve_calls_for_file) is the +# seam a real LSP backend could replace later. + +from __future__ import annotations + +import sys +from pathlib import Path + + +def _import_jedi(): + """Import jedi + parso lazily, then restore the recursion limit. + + parso lowers sys.recursionlimit to 3000 on import, which would silently + undo the indexer's raised limit (it bumps to 10_000 so tree-sitter walks + on deeply nested ASTs don't crash). Re-assert the higher value after the + import so enabling this opt-in feature never weakens the core guard. + Returns (jedi, parso) or None if either is missing. + """ + prior = sys.getrecursionlimit() + try: + import jedi + import parso + except Exception: + return None + if sys.getrecursionlimit() < prior: + sys.setrecursionlimit(prior) + return jedi, parso + + +# Hard cap on call sites resolved per file. goto() is the expensive step; +# a pathological generated file should not stall a scan. Past the cap we +# stop and return what we have. +_MAX_CALL_SITES = 2000 + + +def jedi_available() -> bool: + """True when jedi can be imported. Cheap, swallows any import error.""" + return _import_jedi() is not None + + +def _enclosing_caller_id(leaf, file_path: str) -> str | None: + """Walk up the parso tree from a call leaf to the function that contains + it, and build that function's graph Function id. + + Mirrors the parser's qualname scheme: a method gets + "{file}::{Class}.{method}", a plain function "{file}::{name}". Only the + nearest enclosing class is used (the parser does not model deeper + nesting either). Returns None when the call sits at module level. + """ + funcdef = None + node = leaf.parent + while node is not None: + if getattr(node, "type", None) == "funcdef": + funcdef = node + break + node = node.parent + if funcdef is None: + return None + + fn_name = funcdef.name.value + + # Nearest enclosing class, if any, for the Class.method form. + class_name = None + node = funcdef.parent + while node is not None: + if getattr(node, "type", None) == "classdef": + class_name = node.name.value + break + if getattr(node, "type", None) == "funcdef": + # A function nested inside another function: stop, the parser + # would not attach a class context here. + break + node = node.parent + + if class_name: + return f"{file_path}::{class_name}.{fn_name}" + return f"{file_path}::{fn_name}" + + +def _iter_call_leaves(node): + """Yield the name leaf being called for every call expression in a parso + tree. For "obj.method()" this yields the "method" leaf; for "foo()" the + "foo" leaf. + """ + node_type = getattr(node, "type", None) + if node_type in ("atom_expr", "power"): + children = node.children + for i, ch in enumerate(children): + if ( + getattr(ch, "type", None) == "trailer" + and ch.children + and getattr(ch.children[0], "type", None) == "operator" + and ch.children[0].value == "(" + ): + callee = children[i - 1] if i > 0 else children[0] + leaf = callee + while getattr(leaf, "children", None): + leaf = leaf.children[-1] + if getattr(leaf, "type", None) == "name": + yield leaf + for child in getattr(node, "children", []) or []: + yield from _iter_call_leaves(child) + + +def _target_id( + definition, repo_root: Path, repo_root_real: Path +) -> tuple[str, str] | None: + """Map a jedi Definition to a (target_file, target_id) pair, or None if it + does not resolve to a function/method defined in a file inside the repo. + + repo_root is the path the indexer was called with (used to rebuild the + target file path in the SAME form the indexer stored). repo_root_real is + its symlink-resolved form, used only for the containment check, since + jedi reports module_path with symlinks resolved (e.g. /tmp -> /private/tmp + on macOS). Rebuilding from repo_root keeps the id byte-identical to the + Function node the indexer wrote. + """ + mod_path = definition.module_path + if mod_path is None: + return None + try: + target_real = Path(mod_path).resolve() + except Exception: + return None + + # Keep only definitions inside the repo (drop stdlib / site-packages). + try: + rel = target_real.relative_to(repo_root_real) + except ValueError: + return None + + if definition.type not in ("function", "method"): + return None + + name = definition.name + if not name: + return None + + # Class context comes from the definition's parent. jedi reports a + # method's parent as the owning class. + class_name = None + try: + parent = definition.parent() + except Exception: + parent = None + if parent is not None and getattr(parent, "type", None) == "class": + class_name = parent.name + + # Rebuild the file path from the indexer's (possibly unresolved) repo_root + # so the id matches the stored Function node exactly. + file_str = str(repo_root / rel) + if class_name: + return (file_str, f"{file_str}::{class_name}.{name}") + return (file_str, f"{file_str}::{name}") + + +def resolve_calls_for_file( + file_path: str | Path, repo_root: str | Path +) -> list[tuple[str, str, str]]: + """Resolve every Python call site in ``file_path`` to its definition. + + Returns a list of (caller_id, target_file, target_id) tuples, where + caller_id and target_id are graph Function ids and target_file is the + absolute path of the file that defines the callee. Only callees that + resolve INSIDE repo_root are returned. + + Never raises: if jedi is missing, the file is unreadable, or jedi errors + on a node, the offending item is skipped and resolution continues. A + total failure yields an empty list, which the indexer treats as "fall + back to the name-matched resolver". + """ + mods = _import_jedi() + if mods is None: + return [] + jedi, parso = mods + + file_path = Path(file_path) + repo_root = Path(repo_root) + repo_root_real = repo_root.resolve() + + try: + source = file_path.read_text(encoding="utf-8", errors="replace") + except OSError: + return [] + + try: + project = jedi.Project(str(repo_root_real)) + script = jedi.Script(code=source, path=str(file_path), project=project) + tree = parso.parse(source) + except Exception: + return [] + + out: list[tuple[str, str, str]] = [] + seen: set[tuple[str, str]] = set() + count = 0 + for leaf in _iter_call_leaves(tree): + if count >= _MAX_CALL_SITES: + break + count += 1 + + caller_id = _enclosing_caller_id(leaf, str(file_path)) + if caller_id is None: + # Call at module level: no Function node to anchor the edge. + continue + + line, column = leaf.start_pos + try: + definitions = script.goto(line, column, follow_imports=True) + except Exception: + continue + + for definition in definitions: + try: + resolved = _target_id(definition, repo_root, repo_root_real) + except Exception: + resolved = None + if resolved is None: + continue + target_file, target_id = resolved + # Drop self-recursion noise the name matcher never emitted either. + if target_id == caller_id: + continue + key = (caller_id, target_id) + if key in seen: + continue + seen.add(key) + out.append((caller_id, target_file, target_id)) + + return out diff --git a/codegraph/cli/commands_ensurepath.py b/codegraph/cli/commands_ensurepath.py index 1e97a9d..d76ef61 100644 --- a/codegraph/cli/commands_ensurepath.py +++ b/codegraph/cli/commands_ensurepath.py @@ -11,6 +11,8 @@ from __future__ import annotations +import argparse + import sys from pathlib import Path @@ -18,11 +20,13 @@ from codegraph.state import ensurepath as ep -def cmd_ensurepath(args) -> None: +def cmd_ensurepath(args: argparse.Namespace) -> None: scripts = ep.scripts_dir() if ep.is_on_path(scripts): - console.print(f"[green]cgh is already on your PATH[/green] [dim]({scripts})[/dim]") + console.print( + f"[green]cgh is already on your PATH[/green] [dim]({scripts})[/dim]" + ) return env = ep.detect_env() @@ -47,9 +51,13 @@ def cmd_ensurepath(args) -> None: if not getattr(args, "yes", False) and sys.stdin.isatty(): try: - answer = console.input( - f"Add cgh to PATH by appending to [cyan]{profile}[/cyan]? [Y/n] " - ).strip().lower() + answer = ( + console.input( + f"Add cgh to PATH by appending to [cyan]{profile}[/cyan]? [Y/n] " + ) + .strip() + .lower() + ) except EOFError: answer = "n" if answer in ("n", "no"): diff --git a/codegraph/cli/commands_federate.py b/codegraph/cli/commands_federate.py index fae6d9f..623c0e1 100644 --- a/codegraph/cli/commands_federate.py +++ b/codegraph/cli/commands_federate.py @@ -12,6 +12,8 @@ from __future__ import annotations +import argparse + import os from pathlib import Path @@ -29,7 +31,7 @@ console = Console() -def cmd_federate(args) -> None: +def cmd_federate(args: argparse.Namespace) -> None: """Dispatcher for `cgh federate `.""" action = getattr(args, "action", None) or "list" if action == "add": @@ -45,10 +47,12 @@ def cmd_federate(args) -> None: if action == "down": return _cmd_down(args) console.print(f"[red]Unknown action: {action}[/red]") - console.print("[dim]Usage: cgh federate add | remove | list | verify | up | down[/dim]") + console.print( + "[dim]Usage: cgh federate add | remove | list | verify | up | down[/dim]" + ) -def _cmd_add(args) -> None: +def _cmd_add(args: argparse.Namespace) -> None: paths = getattr(args, "paths", None) or [] if not paths: console.print("[red]Usage: cgh federate add [ …][/red]") @@ -76,7 +80,7 @@ def _cmd_add(args) -> None: console.print(f"[green]✓ {child}[/green] federated.") -def _cmd_remove(args) -> None: +def _cmd_remove(args: argparse.Namespace) -> None: paths = getattr(args, "paths", None) or [] if not paths: console.print("[red]Usage: cgh federate remove [/red]") @@ -89,7 +93,7 @@ def _cmd_remove(args) -> None: console.print(f"[dim]not federated: {raw}[/dim]") -def _cmd_list(args) -> None: +def _cmd_list(args: argparse.Namespace) -> None: root = Path(os.path.abspath(args.root)) children = resolve_children(root) if not children: @@ -99,7 +103,7 @@ def _cmd_list(args) -> None: _render_status_table(root, children) -def _cmd_verify(args) -> None: +def _cmd_verify(args: argparse.Namespace) -> None: """Same as list, plus exits non-zero if any child is broken.""" root = Path(os.path.abspath(args.root)) children = resolve_children(root) @@ -114,7 +118,7 @@ def _cmd_verify(args) -> None: raise SystemExit(1) -def _cmd_up(args) -> None: +def _cmd_up(args: argparse.Namespace) -> None: """Ensure each federated child has its own owner running with --watch. For children whose owner is already alive: no-op. For children that are @@ -141,12 +145,16 @@ def _cmd_up(args) -> None: continue if is_owner_alive(child): owner = child_owner_status(child) - console.print(f"[dim]• {child.name} already running (pid {owner.pid}, port {owner.port})[/dim]") + console.print( + f"[dim]• {child.name} already running (pid {owner.pid}, port {owner.port})[/dim]" + ) continue register_keepalive(child) port = spawn_owner(child, watch=True, reindex=False) if port is None: - console.print(f"[red]✗ {child.name}[/red] failed to start (see {child}/.codegraph/owner.log)") + console.print( + f"[red]✗ {child.name}[/red] failed to start (see {child}/.codegraph/owner.log)" + ) else: console.print(f"[green]✓ {child.name}[/green] started on port {port}") @@ -156,7 +164,7 @@ def _cmd_up(args) -> None: ) -def _cmd_down(args) -> None: +def _cmd_down(args: argparse.Namespace) -> None: """Stop the owner of each federated child (if running) and remove its keepalive marker. Doesn't touch children whose owners were started by something other than `cgh federate up`, they'll just lose their diff --git a/codegraph/cli/commands_githooks.py b/codegraph/cli/commands_githooks.py index b9fb3cc..395f1b7 100644 --- a/codegraph/cli/commands_githooks.py +++ b/codegraph/cli/commands_githooks.py @@ -11,6 +11,8 @@ from __future__ import annotations +import argparse + import os from pathlib import Path @@ -24,7 +26,7 @@ ) -def cmd_githooks(args) -> None: +def cmd_githooks(args: argparse.Namespace) -> None: """Dispatcher for `cgh hooks `.""" action = getattr(args, "action", None) or "status" root = Path(os.path.abspath(args.root)) diff --git a/codegraph/cli/commands_graph.py b/codegraph/cli/commands_graph.py index 856d5c8..a68f5ba 100644 --- a/codegraph/cli/commands_graph.py +++ b/codegraph/cli/commands_graph.py @@ -8,6 +8,8 @@ from __future__ import annotations +import argparse + import os from pathlib import Path @@ -15,7 +17,7 @@ from codegraph.cli import console -SCOPES = ["imports", "calls", "classes", "docs", "overview"] +SCOPES = ["imports", "calls", "classes", "docs", "overview", "layers"] # --------------------------------------------------------------------------- @@ -23,7 +25,9 @@ # --------------------------------------------------------------------------- -def _fetch_mermaid_via_owner(root: str, scope: str, symbol: str, file: str, max_nodes: int) -> str | None: +def _fetch_mermaid_via_owner( + root: str, scope: str, symbol: str, file: str, max_nodes: int +) -> str | None: """ Ask the running MCP owner to build the Mermaid diagram for us. Works while the owner holds the Kuzu write lock (which blocks our @@ -50,6 +54,7 @@ def _fetch_mermaid_via_owner(root: str, scope: str, symbol: str, file: str, max_ "classes": "class_hierarchy", "docs": "doc_structure", "overview": "full_overview", + "layers": "layers", } args_payload = { "scope": scope_map.get(scope, scope), @@ -99,7 +104,7 @@ def _fetch_mermaid_via_owner(root: str, scope: str, symbol: str, file: str, max_ return None -def cmd_graph(args) -> None: +def cmd_graph(args: argparse.Namespace) -> None: """Generate and display a graph visualization.""" from codegraph.core.db import get_readonly_connection from codegraph.viz import ( @@ -108,6 +113,7 @@ def cmd_graph(args) -> None: mermaid_classes, mermaid_docs, mermaid_imports, + mermaid_layers, mermaid_overview, open_in_browser, ) @@ -121,7 +127,9 @@ def cmd_graph(args) -> None: # Try the owner's HTTP endpoint first, it works even when the # Kuzu lock is held (which blocks readonly connections from CLI). - mermaid_code: str | None = _fetch_mermaid_via_owner(root, scope, symbol, file, max_nodes) + mermaid_code: str | None = _fetch_mermaid_via_owner( + root, scope, symbol, file, max_nodes + ) if mermaid_code is None: # Owner not running, open Kuzu directly. @@ -140,6 +148,7 @@ def cmd_graph(args) -> None: "classes": lambda: mermaid_classes(conn, root, symbol, max_nodes), "docs": lambda: mermaid_docs(conn, root, file, max_nodes), "overview": lambda: mermaid_overview(conn, root, max_nodes), + "layers": lambda: mermaid_layers(conn, root, max_nodes), } mermaid_code = generators[scope]() @@ -159,7 +168,9 @@ def cmd_graph(args) -> None: meta += f" file={file}" html_content = generate_html(mermaid_code, scope, root, meta) out_path.write_text(html_content, encoding="utf-8") - console.print(f" [green]+[/green] {out_path} [dim]({len(html_content):,} bytes)[/dim]") + console.print( + f" [green]+[/green] {out_path} [dim]({len(html_content):,} bytes)[/dim]" + ) return # Default: generate HTML and open in browser @@ -187,7 +198,7 @@ def cmd_graph(args) -> None: # --------------------------------------------------------------------------- -def cmd_add_dir(args) -> None: +def cmd_add_dir(args: argparse.Namespace) -> None: """Add or manage extra directories in the graph.""" from codegraph.core.config import CODEGRAPH_DIR, CONFIG_FILE @@ -259,7 +270,9 @@ def cmd_add_dir(args) -> None: console.print(f" [red]-[/red] {d}") return - console.print("[dim]Usage: cgh add-dir add | cgh add-dir remove | cgh add-dir list[/dim]") + console.print( + "[dim]Usage: cgh add-dir add | cgh add-dir remove | cgh add-dir list[/dim]" + ) def _write_extra_dirs(config_path: Path, data: dict, extra_dirs: list[str]) -> None: @@ -301,16 +314,33 @@ def register_graph_parser(sub) -> None: # --- graph --- p = sub.add_parser("graph", help="Visualize the code graph (opens in browser)") - p.add_argument("scope", nargs="?", default="overview", choices=SCOPES, help="What to visualize (default: overview)") + p.add_argument( + "scope", + nargs="?", + default="overview", + choices=SCOPES, + help="What to visualize (default: overview)", + ) p.add_argument("--symbol", "-s", help="Filter to a symbol (for calls/classes)") p.add_argument("--file", "-f", help="Filter to a file (for imports/docs)") - p.add_argument("--max-nodes", "-n", type=int, default=40, help="Max nodes (default: 40)") - p.add_argument("--mermaid", action="store_true", help="Output raw Mermaid to stdout") - p.add_argument("--html", metavar="FILE", help="Write HTML to file instead of opening browser") + p.add_argument( + "--max-nodes", "-n", type=int, default=40, help="Max nodes (default: 40)" + ) + p.add_argument( + "--mermaid", action="store_true", help="Output raw Mermaid to stdout" + ) + p.add_argument( + "--html", metavar="FILE", help="Write HTML to file instead of opening browser" + ) p.add_argument("--root", default=os.getcwd()) # --- add-dir --- p = sub.add_parser("add-dir", help="Manage extra directories in the graph") - p.add_argument("action", nargs="?", choices=["add", "remove", "list"], help="Action (default: list)") + p.add_argument( + "action", + nargs="?", + choices=["add", "remove", "list"], + help="Action (default: list)", + ) p.add_argument("paths", nargs="*", help="Directory paths") p.add_argument("--root", default=os.getcwd()) diff --git a/codegraph/cli/commands_hooks.py b/codegraph/cli/commands_hooks.py index 55b6767..9c14fc6 100644 --- a/codegraph/cli/commands_hooks.py +++ b/codegraph/cli/commands_hooks.py @@ -8,6 +8,8 @@ from __future__ import annotations +import argparse + import json import os import re @@ -26,7 +28,7 @@ _MIN_SYMBOLS_FOR_OUTLINE_HINT = 5 -def cmd_hook_precheck_grep(args) -> None: +def cmd_hook_precheck_grep(args: argparse.Namespace) -> None: """ PreToolUse hook for Grep. Reads the hook payload from stdin, and when the pattern looks like a bare identifier prints a suggestion to stderr @@ -56,7 +58,7 @@ def cmd_hook_precheck_grep(args) -> None: sys.exit(0) -def cmd_hook_precheck_read(args) -> None: +def cmd_hook_precheck_read(args: argparse.Namespace) -> None: """ PreToolUse hook for Read. When the file is indexed in cgh's FTS and the Read is a full read (no offset/limit), suggest file_outline / symbols_in_file diff --git a/codegraph/cli/commands_impact.py b/codegraph/cli/commands_impact.py new file mode 100644 index 0000000..538385f --- /dev/null +++ b/codegraph/cli/commands_impact.py @@ -0,0 +1,278 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: `cgh impact --since ` CI command for PR bots. Diffs the +# working tree against a git ref, then reads the graph read-only +# to report changed symbols, the IMPORTS blast radius grouped by +# role / layer, endpoints touched, and tests to run. Emits JSON +# (machine-parseable on stdout) or a markdown PR-comment summary. +# Runs without an MCP owner: opens the graph DB read-only and +# degrades gracefully when the index is missing or stale. + +from __future__ import annotations + +import argparse +import json +import os +import subprocess +import sys +from pathlib import Path + +from rich.console import Console + +from codegraph.cli import LOGO + +# Banner + notes go to stderr so stdout stays a clean JSON / markdown stream +# that a PR bot can pipe and parse. +_err = Console(stderr=True) + + +def _git_changed_files(root: str, since: str) -> tuple[list[str], str | None]: + """Return (changed_files, error). Diffs the working tree against ``since``. + + Mirrors the validation in tools_index.index_changed_files: a leading dash + is rejected so a value like "--output=/x" cannot be read as a git flag, + and the trailing "--" keeps the ref from being parsed as a pathspec. + """ + if since.startswith("-"): + return [], f"invalid git ref: {since!r}" + cmd = [ + "git", + "diff", + "--name-only", + "--diff-filter=ACMR", + f"{since}...", + "--", + ] + try: + result = subprocess.run( + cmd, + capture_output=True, + text=True, + encoding="utf-8", + errors="replace", + cwd=root, + timeout=30, + ) + except Exception as exc: + return [], f"git diff failed: {exc}" + if result.returncode != 0: + msg = (result.stderr or "").strip() or f"git diff exited {result.returncode}" + return [], f"git diff failed: {msg}" + files = [ + f.strip() + for f in result.stdout.strip().splitlines() + if f.strip() and not f.strip().startswith(".codegraph/") + ] + return files, None + + +def _build_report(conn, root: str, changed_files: list[str]) -> dict: + """Assemble the impact report from the graph for the changed files. + + Uses the shared analysis helpers so the CLI and the MCP tools stay in + lockstep. All paths returned to the caller are repo-relative. + """ + from codegraph.analysis import impact as _impact + + root_path = Path(root).resolve() + + def _rel(p: str) -> str: + try: + return str(Path(p).resolve().relative_to(root_path)) + except (ValueError, OSError): + return p + + # Changed files resolve to absolute File-node keys for graph lookups. + abs_changed = [str((root_path / f)) for f in changed_files] + + changed_symbols: list[dict] = [] + for abs_f, rel_f in zip(abs_changed, changed_files): + for sym in _impact.symbols_in_file(conn, abs_f): + changed_symbols.append({"file": rel_f, **sym}) + + # Blast radius: files that transitively import any changed file. + radius, radius_trunc = _impact.reverse_import_bfs(conn, abs_changed, max_depth=3) + + impacted: list[dict] = [] + by_role: dict[str, int] = {} + by_layer: dict[str, int] = {} + for abs_p in radius: + role, layer = _impact.file_role(conn, abs_p) + impacted.append({"file": _rel(abs_p), "role": role, "layer": layer}) + if role: + by_role[role] = by_role.get(role, 0) + 1 + if layer: + by_layer[layer] = by_layer.get(layer, 0) + 1 + + # Endpoints declared in the changed files OR any impacted file. + endpoint_scope = abs_changed + radius + endpoints = [ + {"file": _rel(e["file"]), "method": e["method"], "path": e["path"]} + for e in _impact.endpoints_in_files(conn, endpoint_scope) + ] + + # Tests to run: for each changed file, the test files that exercise it. + test_seen: set[str] = set() + tests: list[dict] = [] + for abs_f in abs_changed: + for t in _impact.tests_for_file(conn, abs_f): + rel_t = _rel(t["file"]) + if rel_t in test_seen: + continue + test_seen.add(rel_t) + tests.append({"file": rel_t, "role": t["role"]}) + + return { + "since_changed": changed_files, + "changed_symbols": changed_symbols, + "impacted": impacted, + "impacted_count": len(impacted), + "impacted_by_role": by_role, + "impacted_by_layer": by_layer, + "endpoints": endpoints, + "tests_to_run": tests, + "truncated": radius_trunc, + "note": ( + "Blast radius and tests are inferred from IMPORTS / CALLS edges, " + "not a coverage run. Keep the index fresh with `cgh index` in CI." + ), + } + + +def _render_markdown(report: dict, since: str) -> str: + """Render the report as a PR-comment-friendly markdown summary.""" + lines: list[str] = [] + lines.append(f"## cgh impact (since `{since}`)") + lines.append("") + + changed = report["since_changed"] + lines.append(f"**Changed files ({len(changed)})**") + if changed: + for f in changed: + lines.append(f"- `{f}`") + else: + lines.append("- _none_") + lines.append("") + + impacted = report["impacted"] + lines.append(f"**Impacted files ({report['impacted_count']})**") + if impacted: + # Group by layer for a compact read. + by_layer: dict[str, list[dict]] = {} + for row in impacted: + by_layer.setdefault(row.get("layer") or "other", []).append(row) + for layer in sorted(by_layer): + rows = by_layer[layer] + lines.append(f"- _{layer}_ ({len(rows)})") + for row in rows[:25]: + role = row.get("role") or "" + suffix = f" `{role}`" if role else "" + lines.append(f" - `{row['file']}`{suffix}") + if len(rows) > 25: + lines.append(f" - _... {len(rows) - 25} more_") + else: + lines.append("- _none_") + lines.append("") + + endpoints = report["endpoints"] + lines.append(f"**Endpoints touched ({len(endpoints)})**") + if endpoints: + for e in endpoints: + method = e.get("method") or "?" + lines.append(f"- `{method} {e.get('path', '')}` ({e['file']})") + else: + lines.append("- _none_") + lines.append("") + + tests = report["tests_to_run"] + lines.append(f"**Tests to run ({len(tests)})**") + if tests: + for t in tests: + lines.append(f"- `{t['file']}`") + else: + lines.append("- _no importing tests found_") + lines.append("") + + if report.get("truncated"): + lines.append("> Note: blast radius was truncated (large graph).") + lines.append("") + lines.append(f"> {report['note']}") + return "\n".join(lines) + + +def cmd_impact(args: argparse.Namespace) -> None: + """Handler for `cgh impact`. Non-MCP, CI-oriented: diffs against a ref, + reads the graph read-only, and emits JSON or markdown.""" + root = os.path.abspath(args.root) + since = getattr(args, "since", "HEAD~1") or "HEAD~1" + + # --json is shorthand for --format json; default format is markdown. + fmt = getattr(args, "format", "md") or "md" + if getattr(args, "json", False): + fmt = "json" + want_json = fmt == "json" + + # Banner to stderr only, never pollute the JSON / markdown on stdout. + _err.print(LOGO) + _err.print( + "[dim]impact: diffing against " + f"[/dim][cyan]{since}[/cyan][dim], reading graph read-only. " + "Keep the index fresh with [/dim][cyan]cgh index[/cyan][dim] in CI.[/dim]\n" + ) + + if not (Path(root) / ".codegraph").is_dir(): + _fail( + want_json, + "repo is not indexed by cgh (.codegraph/ missing). " + "Run `cgh init` then `cgh index`.", + ) + return + + changed, err = _git_changed_files(root, since) + if err is not None: + _fail(want_json, err) + return + + # Open the graph read-only directly, no MCP owner required. When an owner + # holds the write lock, get_readonly_connection returns None; tell the + # caller clearly rather than emitting a misleading empty report. + from codegraph.core.db import get_readonly_connection + + conn = None + try: + conn = get_readonly_connection(root) + except Exception as exc: + _fail(want_json, f"could not open graph read-only: {exc}") + return + + if conn is None: + _fail( + want_json, + "graph DB is locked (an MCP owner is running) or missing. " + "Stop the owner with `cgh serve --stop`, or run this in CI where " + "no owner is alive.", + ) + return + + report = _build_report(conn, root, changed) + report["since"] = since + + if want_json: + # Clean machine-parseable stdout. + print(json.dumps(report, indent=2)) + else: + print(_render_markdown(report, since)) + + +def _fail(want_json: bool, message: str) -> None: + """Emit a graceful error. JSON mode keeps stdout parseable with an + {"error": ...} object; markdown mode writes the note to stderr.""" + if want_json: + print(json.dumps({"error": message}, indent=2)) + else: + _err.print(f"[yellow]{message}[/yellow]") + sys.exit(1) diff --git a/codegraph/cli/commands_index.py b/codegraph/cli/commands_index.py index ef9795a..a8e13bf 100644 --- a/codegraph/cli/commands_index.py +++ b/codegraph/cli/commands_index.py @@ -8,13 +8,22 @@ from __future__ import annotations +import argparse + import os import sys from pathlib import Path from rich import box from rich.panel import Panel -from rich.progress import BarColumn, MofNCompleteColumn, Progress, SpinnerColumn, TextColumn, TimeElapsedColumn +from rich.progress import ( + BarColumn, + MofNCompleteColumn, + Progress, + SpinnerColumn, + TextColumn, + TimeElapsedColumn, +) from rich.table import Table from codegraph.cli import LOGO, _lang_color, _short_path, console @@ -24,7 +33,7 @@ # --------------------------------------------------------------------------- -def cmd_index(args) -> None: +def cmd_index(args: argparse.Namespace) -> None: from codegraph.indexer import index_repo from codegraph.state.ipc import is_owner_alive, read_owner_port @@ -139,7 +148,10 @@ def _print_index_summary(stats: dict) -> None: table.add_column("Value", justify="right") indexed = stats.get("indexed", stats.get("reindexed_count", 0)) skipped = stats.get("skipped", stats.get("unchanged_count", 0)) - deleted = stats.get("deleted_count", len(stats.get("deleted", [])) if isinstance(stats.get("deleted"), list) else 0) + deleted = stats.get( + "deleted_count", + len(stats.get("deleted", [])) if isinstance(stats.get("deleted"), list) else 0, + ) method = stats.get("method", stats.get("mode", "?")) table.add_row("Files indexed", f"[green]{indexed}[/green]") table.add_row("Files skipped", f"[dim]{skipped}[/dim]") @@ -158,7 +170,7 @@ def _print_index_summary(stats: dict) -> None: # --------------------------------------------------------------------------- -def cmd_memory_index(args) -> None: +def cmd_memory_index(args: argparse.Namespace) -> None: """Scan the Claude Code memory directory into the FTS index.""" from codegraph.claude_state.memory import scan_memory_dir @@ -183,7 +195,7 @@ def cmd_memory_index(args) -> None: # --------------------------------------------------------------------------- -def cmd_plan_index(args) -> None: +def cmd_plan_index(args: argparse.Namespace) -> None: """Scan ~/.claude/plans/ into the FTS index.""" from codegraph.claude_state.plans import scan_plan_dir @@ -208,7 +220,7 @@ def cmd_plan_index(args) -> None: # --------------------------------------------------------------------------- -def cmd_watch(args) -> None: +def cmd_watch(args: argparse.Namespace) -> None: from codegraph.indexer import index_repo from codegraph.state.watcher import watch_forever @@ -220,7 +232,9 @@ def cmd_watch(args) -> None: with console.status("[bold blue]Initial index...", spinner="dots"): stats = index_repo(root, verbose=False) - console.print(f"[green]Initial index done[/green] -- {stats['indexed']} files in {stats['elapsed_s']}s") + console.print( + f"[green]Initial index done[/green] -- {stats['indexed']} files in {stats['elapsed_s']}s" + ) console.print("[dim]Watching for changes... (Ctrl-C to stop)[/dim]\n") watch_forever(root) @@ -230,7 +244,7 @@ def cmd_watch(args) -> None: # --------------------------------------------------------------------------- -def cmd_serve(args) -> None: +def cmd_serve(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) # --background: spawn/reuse the owner, drop a persistent keepalive @@ -258,7 +272,9 @@ def cmd_serve(args) -> None: reindex=getattr(args, "reindex", False), ) if port is None: - console.print("[red]Failed to start owner (see .codegraph/owner.log)[/red]") + console.print( + "[red]Failed to start owner (see .codegraph/owner.log)[/red]" + ) return console.print(f"[green]Owner started on port {port}[/green]") @@ -299,7 +315,9 @@ def cmd_serve(args) -> None: terminate(pid, graceful_timeout=5.0) if is_pid_alive(pid): - console.print(f"[yellow]Owner (pid {pid}) force-killed after timeout.[/yellow]") + console.print( + f"[yellow]Owner (pid {pid}) force-killed after timeout.[/yellow]" + ) else: console.print(f"[green]Owner (pid {pid}) stopped.[/green]") # Belt-and-braces: drop stale ipc files if the owner crashed @@ -333,7 +351,7 @@ def cmd_serve(args) -> None: # --------------------------------------------------------------------------- -def cmd_force_index(args) -> None: +def cmd_force_index(args: argparse.Namespace) -> None: from codegraph.indexer import _PARSERS, index_file root = Path(os.path.abspath(args.root)) @@ -348,7 +366,10 @@ def cmd_force_index(args) -> None: border_style="yellow", ) ) - if console.input("[yellow]Continue? [y/N][/yellow] ").strip().lower() not in ("y", "yes"): + if console.input("[yellow]Continue? [y/N][/yellow] ").strip().lower() not in ( + "y", + "yes", + ): console.print("[dim]Aborted.[/dim]") return @@ -360,7 +381,9 @@ def cmd_force_index(args) -> None: ok = index_file(target, root, force=True) if ok: indexed += 1 - status.update(f"[bold yellow]Indexed:[/bold yellow] {target.relative_to(root)}") + status.update( + f"[bold yellow]Indexed:[/bold yellow] {target.relative_to(root)}" + ) elif target.is_dir(): for dirpath, _, filenames in os.walk(target): for filename in filenames: @@ -369,14 +392,16 @@ def cmd_force_index(args) -> None: ok = index_file(full, root, force=True) if ok: indexed += 1 - status.update(f"[bold yellow]Indexed:[/bold yellow] {full.relative_to(root)}") + status.update( + f"[bold yellow]Indexed:[/bold yellow] {full.relative_to(root)}" + ) else: console.print(f" [red]x[/red] {p} (not found)") console.print(f"[green]Force-indexed {indexed} file(s)[/green]") -def cmd_reindex_hook(args) -> None: +def cmd_reindex_hook(args: argparse.Namespace) -> None: """Hidden command invoked by the cgh git hooks (post-merge, post-checkout, post-rewrite). Quietly runs an incremental reindex so the graph tracks the content brought in by a pull, merge, branch switch, or rebase. @@ -395,7 +420,9 @@ def cmd_reindex_hook(args) -> None: if is_owner_alive(root): port = read_owner_port(root) if port: - from codegraph.cli.commands_monitor import _ask_owner_incremental_reindex + from codegraph.cli.commands_monitor import ( + _ask_owner_incremental_reindex, + ) _ask_owner_incremental_reindex(root, port) return diff --git a/codegraph/cli/commands_init.py b/codegraph/cli/commands_init.py index 88648cf..c243b1a 100644 --- a/codegraph/cli/commands_init.py +++ b/codegraph/cli/commands_init.py @@ -49,7 +49,11 @@ def _configure_claude_auto_accept(root: Path) -> list[str]: allow = data.setdefault("permissions", {}).setdefault("allow", []) # Drop redundant per-tool entries - allow[:] = [item for item in allow if not (item.startswith("mcp__codegraph__") and item != "mcp__codegraph__*")] + allow[:] = [ + item + for item in allow + if not (item.startswith("mcp__codegraph__") and item != "mcp__codegraph__*") + ] # Add wildcards if missing added: list[str] = [] @@ -147,7 +151,11 @@ def _render() -> Table: for ts, event, detail in entries: age = now - ts when = f"{int(age)}s ago" if age < 60 else f"{int(age / 60)}m ago" - style = "green" if event.endswith("_end") else ("yellow" if "error" in event else "cyan") + style = ( + "green" + if event.endswith("_end") + else ("yellow" if "error" in event else "cyan") + ) tbl.add_row(when, f"[{style}]{event}[/{style}]", detail) return tbl @@ -157,7 +165,9 @@ def _render() -> Table: _t.sleep(0.5) live.update(_render()) except KeyboardInterrupt: - console_obj.print("\n [yellow]stopped watching, owner may still be working in the background[/yellow]") + console_obj.print( + "\n [yellow]stopped watching, owner may still be working in the background[/yellow]" + ) return # Report result @@ -168,7 +178,9 @@ def _render() -> Table: ) return if result_holder["status"] != 200: - console_obj.print(f" [yellow]owner returned HTTP {result_holder['status']}[/yellow]") + console_obj.print( + f" [yellow]owner returned HTTP {result_holder['status']}[/yellow]" + ) return # Try to pull the JSON stats out of the MCP response @@ -314,7 +326,9 @@ def _detect_existing_state(root: Path) -> dict: cfg = cg_dir / "config.toml" if cfg.exists(): with open(cfg, "rb") as f: - state["extra_dirs"] = tomllib.load(f).get("codegraph", {}).get("extra_dirs", []) + state["extra_dirs"] = ( + tomllib.load(f).get("codegraph", {}).get("extra_dirs", []) + ) except Exception: pass @@ -330,7 +344,9 @@ def _detect_existing_state(root: Path) -> dict: if fp.exists(): try: content = fp.read_text(encoding="utf-8", errors="replace") - state["agent_blocks"][tool_key] = marker in content or tool_key == "cursor" + state["agent_blocks"][tool_key] = ( + marker in content or tool_key == "cursor" + ) except OSError: state["agent_blocks"][tool_key] = False @@ -341,7 +357,9 @@ def _detect_existing_state(root: Path) -> dict: if skills_dir.exists(): try: state["claude_skills_installed"] = sorted( - d.name for d in skills_dir.iterdir() if d.is_dir() and d.name.startswith("cgh-") + d.name + for d in skills_dir.iterdir() + if d.is_dir() and d.name.startswith("cgh-") ) except OSError: pass @@ -359,7 +377,9 @@ def _detect_existing_state(root: Path) -> dict: import json as _json data = _json.loads(mcp_path.read_text(encoding="utf-8")) - state["mcp_server_configured"] = "codegraph" in (data.get("mcpServers") or {}) + state["mcp_server_configured"] = "codegraph" in ( + data.get("mcpServers") or {} + ) except Exception: pass @@ -436,9 +456,7 @@ def _auto_migrate_kuzu_to_duckdb(root: Path) -> None: f" [yellow]Counts differ between Kuzu ({result.kuzu_nodes:,} nodes) " f"and DuckDB ({result.duckdb_nodes:,} nodes). Kept both files.[/yellow]" ) - console.print( - f" [dim]{result.message}[/dim]" - ) + console.print(f" [dim]{result.message}[/dim]") console.print( " [dim]Inspect with [cyan]cgh status[/cyan], then " "[cyan]cgh migrate-to-duckdb --force[/cyan] to retry.[/dim]\n" @@ -471,105 +489,81 @@ def _install_git_reindex_hooks(root: Path) -> None: # --------------------------------------------------------------------------- -# cmd_init +# cmd_init helpers # --------------------------------------------------------------------------- -def cmd_init(args) -> None: - import glob - import shutil - - import questionary - from questionary import Style - - from codegraph.core.config import init_project - - root = Path(os.path.abspath(args.root)) - console.print(LOGO) - console.print(f" [dim]Project:[/dim] [bold]{root}[/bold]\n") - - cg_style = Style( - [ - ("qmark", "fg:cyan bold"), - ("question", "fg:white bold"), - ("answer", "fg:green bold"), - ("pointer", "fg:cyan bold"), - ("highlighted", "fg:cyan bold"), - ("selected", "fg:green"), - ("separator", "fg:cyan"), - ("instruction", "fg:white dim"), - ("text", "fg:white"), - ] - ) - - # -- Step 0: Probe existing state (before anything mutates disk) -- - prior_state = _detect_existing_state(root) - if prior_state["initialized"]: - console.print(" [bold]Existing codegraph state detected:[/bold]\n") - bits: list[str] = [] - if prior_state["indexed_files"] > 0: - bits.append(f"{prior_state['indexed_files']:,} files indexed") - if prior_state["graph_db_bytes"] > 0: - bits.append(f"graph.db {prior_state['graph_db_bytes'] // 1024} KB") - if prior_state["owner_alive"]: - bits.append( - f"[green]owner running[/green] (pid {prior_state['owner_pid']} port {prior_state['owner_port']})" - ) - elif prior_state["owner_pid"]: - bits.append(f"[yellow]stale owner.pid {prior_state['owner_pid']}[/yellow]") - if prior_state["scan_meta"] and prior_state["scan_meta"].get("git_head"): - sha = prior_state["scan_meta"]["git_head"][:8] - branch = prior_state["scan_meta"].get("git_branch") or "?" - bits.append(f"last scan at {sha} on {branch}") - if prior_state["extra_dirs"]: - bits.append(f"{len(prior_state['extra_dirs'])} extra_dirs") - if prior_state["mcp_server_configured"]: - bits.append(".mcp.json already has codegraph") - if prior_state.get("claude_skills_installed"): - n = len(prior_state["claude_skills_installed"]) - mod = len(prior_state.get("claude_skills_modified") or []) - label = f"{n} claude skill{'s' if n != 1 else ''}" - if mod: - label += f" ([yellow]{mod} modified locally[/yellow])" - bits.append(label) - blocks_present = [k for k, v in prior_state["agent_blocks"].items() if v] - if blocks_present: - bits.append("agent blocks: " + ", ".join(blocks_present)) - for b in bits: - console.print(f" • {b}") - if not bits: - console.print(" [dim](initialized but empty, safe to full scan)[/dim]") - console.print() - - # -- Auto-migrate Kuzu -> DuckDB before anything else touches the DB -- - _auto_migrate_kuzu_to_duckdb(root) - - # -- Step 1: Create .codegraph/ -- - with console.status("[bold cyan]Setting up codegraph...", spinner="dots"): - result = init_project(root) - - if result["created"]: - for f in result["created"]: - console.print(f" [green]+[/green] {f}") - else: - console.print(" [dim].codegraph/ already exists[/dim]") - +def _print_prior_state(prior_state: dict) -> None: + """Render the 'existing codegraph state detected' summary for an + already-initialized repo. No-op when the repo is fresh.""" + if not prior_state["initialized"]: + return + console.print(" [bold]Existing codegraph state detected:[/bold]\n") + bits: list[str] = [] + if prior_state["indexed_files"] > 0: + bits.append(f"{prior_state['indexed_files']:,} files indexed") + if prior_state["graph_db_bytes"] > 0: + bits.append(f"graph.db {prior_state['graph_db_bytes'] // 1024} KB") + if prior_state["owner_alive"]: + bits.append( + f"[green]owner running[/green] (pid {prior_state['owner_pid']} port {prior_state['owner_port']})" + ) + elif prior_state["owner_pid"]: + bits.append(f"[yellow]stale owner.pid {prior_state['owner_pid']}[/yellow]") + if prior_state["scan_meta"] and prior_state["scan_meta"].get("git_head"): + sha = prior_state["scan_meta"]["git_head"][:8] + branch = prior_state["scan_meta"].get("git_branch") or "?" + bits.append(f"last scan at {sha} on {branch}") + if prior_state["extra_dirs"]: + bits.append(f"{len(prior_state['extra_dirs'])} extra_dirs") + if prior_state["mcp_server_configured"]: + bits.append(".mcp.json already has codegraph") + if prior_state.get("claude_skills_installed"): + n = len(prior_state["claude_skills_installed"]) + mod = len(prior_state.get("claude_skills_modified") or []) + label = f"{n} claude skill{'s' if n != 1 else ''}" + if mod: + label += f" ([yellow]{mod} modified locally[/yellow])" + bits.append(label) + blocks_present = [k for k, v in prior_state["agent_blocks"].items() if v] + if blocks_present: + bits.append("agent blocks: " + ", ".join(blocks_present)) + for b in bits: + console.print(f" • {b}") + if not bits: + console.print(" [dim](initialized but empty, safe to full scan)[/dim]") console.print() - # -- Step 1b: git hooks that keep the graph fresh after pull/merge/checkout -- - _install_git_reindex_hooks(root) - # -- Step 2: Detect AI tools -- +def _detect_ai_tools(root: Path) -> list[tuple[str, str, bool]]: + """Probe for installed AI tools, print the detection table, and return + the full (name, key, detected) list.""" + import shutil + console.print(" [bold]Detecting AI tools...[/bold]\n") all_tools = [ - ("Claude Code", "claude", (root / ".claude").exists() or shutil.which("claude") is not None), - ("Cursor", "cursor", (root / ".cursor").exists() or (root / ".cursorrules").exists()), - ("Codex CLI", "codex", (root / "AGENTS.md").exists() or shutil.which("codex") is not None), + ( + "Claude Code", + "claude", + (root / ".claude").exists() or shutil.which("claude") is not None, + ), + ( + "Cursor", + "cursor", + (root / ".cursor").exists() or (root / ".cursorrules").exists(), + ), + ( + "Codex CLI", + "codex", + (root / "AGENTS.md").exists() or shutil.which("codex") is not None, + ), ( "Gemini CLI", "gemini", - (root / "GEMINI.md").exists() or (root / ".gemini").exists() or shutil.which("gemini") is not None, + (root / "GEMINI.md").exists() + or (root / ".gemini").exists() + or shutil.which("gemini") is not None, ), ] @@ -579,9 +573,20 @@ def cmd_init(args) -> None: console.print(f" {icon} {name:15s} {status}") console.print() - detected_tools = [(name, key) for name, key, detected in all_tools if detected] + return all_tools + + +def _select_tools( + all_tools: list[tuple[str, str, bool]], + detected_tools: list[tuple[str, str]], + args: argparse.Namespace, + cg_style, +) -> list[str]: + """Run the interactive multi-select (or auto-pick under --yes) for which + tools to configure, then print the skipped list. Returns the selected + tool keys.""" + import questionary - # -- Step 3: Select which tools to configure -- selected_keys = [] if detected_tools and not args.yes: choices = [ @@ -624,6 +629,164 @@ def cmd_init(args) -> None: + ", ".join(f"[dim]{k}[/dim]" for k in skipped) + " [dim](no config, no agent block, no skills)[/dim]\n" ) + return selected_keys + + +def _count_parseable_files(root: Path) -> dict[str, int]: + """Count files the indexer will actually process, by language. Uses + git ls-files (respects .gitignore) and falls back to glob for a + non-git repo. Subrepo paths federated above are skipped.""" + import glob + + from codegraph.analysis.federation import child_paths_to_skip, is_under_any + from codegraph.parsers import get_parser_info + + parsers = get_parser_info() + ext_to_lang = {ext: info["lang"] for info in parsers for ext in info["extensions"]} + + # Federation skip list, if the user federated subrepos in step 3d above, + # they should NOT contribute to the file count. + skip_paths = child_paths_to_skip(root) + + file_counts: dict[str, int] = {} + try: + import subprocess + + result = subprocess.run( + ["git", "ls-files", "--cached", "--others", "--exclude-standard"], + capture_output=True, + text=True, + encoding="utf-8", + errors="replace", + cwd=str(root), + timeout=30, + ) + if result.returncode == 0: + for line in result.stdout.splitlines(): + if not line: + continue + if skip_paths and is_under_any(root / line, skip_paths): + continue + suffix = Path(line).suffix.lower() + lang = ext_to_lang.get(suffix) + if lang: + file_counts[lang] = file_counts.get(lang, 0) + 1 + else: + raise RuntimeError("git ls-files failed") + except (subprocess.TimeoutExpired, FileNotFoundError, RuntimeError, OSError): + # Fallback, glob from project root. Filter out subrepo paths so + # the count reflects what the indexer will actually process. + for info in parsers: + count = 0 + for ext in info["extensions"]: + for match in glob.glob( + f"**/*{ext}", root_dir=str(root), recursive=True + ): + full = root / match + if skip_paths and is_under_any(full, skip_paths): + continue + count += 1 + if count > 0: + file_counts[info["lang"]] = count + + return file_counts + + +def _print_file_counts(file_counts: dict[str, int]) -> None: + """Render the per-language 'Files to index' bar chart.""" + if not file_counts: + return + console.print(" [bold]Files to index:[/bold]\n") + lang_colors = { + "python": "green", + "typescript": "blue", + "javascript": "yellow", + "terraform": "magenta", + "markdown": "cyan", + "vue": "green", + "nuxt_config": "green", + } + for lang, count in sorted(file_counts.items(), key=lambda x: -x[1]): + color = lang_colors.get(lang, "white") + bar_len = min(count // 5, 30) or 1 + bar = f"[{color}]{'>' * bar_len}[/{color}]" + console.print(f" [{color}]{lang:12s}[/{color}] {count:>5d} files {bar}") + console.print() + + +def _print_init_summary() -> None: + """Render the closing 'codegraph is ready!' panel.""" + console.print() + console.print( + Panel( + "[bold]codegraph is ready![/bold]\n\n" + " [cyan]cgh stats[/cyan] View graph statistics\n" + " [cyan]cgh search X[/cyan] Find symbols\n" + " [cyan]cgh serve[/cyan] Start MCP server\n" + " [cyan]cgh parsers[/cyan] List supported languages\n" + " [cyan]cgh --help[/cyan] All commands", + border_style="green", + ) + ) + + +# --------------------------------------------------------------------------- +# cmd_init +# --------------------------------------------------------------------------- + + +def cmd_init(args: argparse.Namespace) -> None: + import questionary + from questionary import Style + + from codegraph.core.config import init_project + + root = Path(os.path.abspath(args.root)) + console.print(LOGO) + console.print(f" [dim]Project:[/dim] [bold]{root}[/bold]\n") + + cg_style = Style( + [ + ("qmark", "fg:cyan bold"), + ("question", "fg:white bold"), + ("answer", "fg:green bold"), + ("pointer", "fg:cyan bold"), + ("highlighted", "fg:cyan bold"), + ("selected", "fg:green"), + ("separator", "fg:cyan"), + ("instruction", "fg:white dim"), + ("text", "fg:white"), + ] + ) + + # -- Step 0: Probe existing state (before anything mutates disk) -- + prior_state = _detect_existing_state(root) + _print_prior_state(prior_state) + + # -- Auto-migrate Kuzu -> DuckDB before anything else touches the DB -- + _auto_migrate_kuzu_to_duckdb(root) + + # -- Step 1: Create .codegraph/ -- + with console.status("[bold cyan]Setting up codegraph...", spinner="dots"): + result = init_project(root) + + if result["created"]: + for f in result["created"]: + console.print(f" [green]+[/green] {f}") + else: + console.print(" [dim].codegraph/ already exists[/dim]") + + console.print() + + # -- Step 1b: git hooks that keep the graph fresh after pull/merge/checkout -- + _install_git_reindex_hooks(root) + + # -- Step 2: Detect AI tools -- + all_tools = _detect_ai_tools(root) + detected_tools = [(name, key) for name, key, detected in all_tools if detected] + + # -- Step 3: Select which tools to configure -- + selected_keys = _select_tools(all_tools, detected_tools, args, cg_style) # Check for locally-edited skills before overwriting overwrite_skills = True @@ -632,16 +795,22 @@ def cmd_init(args) -> None: modified = detect_modified_skills(root) if modified and not args.yes: - console.print(" [yellow]You have locally-edited skills in .claude/skills/:[/yellow]") + console.print( + " [yellow]You have locally-edited skills in .claude/skills/:[/yellow]" + ) for m in modified: - console.print(f" • [yellow]{m}[/yellow] (SKILL.md differs from bundled)") + console.print( + f" • [yellow]{m}[/yellow] (SKILL.md differs from bundled)" + ) overwrite_skills = questionary.confirm( "Overwrite with the bundled versions? (Your edits will be lost)", default=False, style=cg_style, ).ask() if not overwrite_skills: - console.print(" [green]Keeping your edits[/green], will refresh only new / unmodified skills.\n") + console.print( + " [green]Keeping your edits[/green], will refresh only new / unmodified skills.\n" + ) for key in selected_keys: _install_integration(root, key, overwrite_skills=overwrite_skills) @@ -667,7 +836,9 @@ def cmd_init(args) -> None: f"[dim](permissions.allow += {', '.join(added)})[/dim]" ) else: - console.print(" [dim]•[/dim] .claude/settings.local.json [dim](already allows codegraph)[/dim]") + console.print( + " [dim]•[/dim] .claude/settings.local.json [dim](already allows codegraph)[/dim]" + ) console.print() # -- Step 3c: offer to inject codegraph usage guidelines into agent rules -- @@ -680,7 +851,9 @@ def cmd_init(args) -> None: "gemini": "GEMINI.md", "cursor": ".cursor/rules/codegraph-usage.mdc", } - inject_targets = [(k, target_files[k]) for k in selected_keys if k in target_files] + inject_targets = [ + (k, target_files[k]) for k in selected_keys if k in target_files + ] if inject_targets: targets_str = ", ".join(f for _, f in inject_targets) inject = ( @@ -697,7 +870,9 @@ def cmd_init(args) -> None: written = install_usage_guidelines(root, tool_key) if written: rel = written.replace(str(root) + "/", "") - console.print(f" [green]+[/green] {rel} [dim](codegraph usage block)[/dim]") + console.print( + f" [green]+[/green] {rel} [dim](codegraph usage block)[/dim]" + ) console.print() # -- Step 3d: Detect already-initialized subrepos -- @@ -742,75 +917,13 @@ def cmd_init(args) -> None: console.print(f" [yellow]⚠[/yellow] {s.name}: {exc}") console.print() else: - console.print(" [dim]Skipped. To federate later: [cyan]cgh federate add [/cyan][/dim]\n") + console.print( + " [dim]Skipped. To federate later: [cyan]cgh federate add [/cyan][/dim]\n" + ) # -- Step 4: Detect parseable files -- - # Use git ls-files to match what the real indexer will process - # (respects .gitignore). Fall back to glob if not a git repo. - from codegraph.analysis.federation import child_paths_to_skip, is_under_any - from codegraph.parsers import get_parser_info - - parsers = get_parser_info() - ext_to_lang = {ext: info["lang"] for info in parsers for ext in info["extensions"]} - - # Federation skip list, if the user federated subrepos in step 3d above, - # they should NOT contribute to the file count. - skip_paths = child_paths_to_skip(root) - - file_counts: dict[str, int] = {} - try: - import subprocess - - result = subprocess.run( - ["git", "ls-files", "--cached", "--others", "--exclude-standard"], - capture_output=True, - text=True, encoding="utf-8", errors="replace", - cwd=str(root), - timeout=30, - ) - if result.returncode == 0: - for line in result.stdout.splitlines(): - if not line: - continue - if skip_paths and is_under_any(root / line, skip_paths): - continue - suffix = Path(line).suffix.lower() - lang = ext_to_lang.get(suffix) - if lang: - file_counts[lang] = file_counts.get(lang, 0) + 1 - else: - raise RuntimeError("git ls-files failed") - except (subprocess.TimeoutExpired, FileNotFoundError, RuntimeError, OSError): - # Fallback, glob from project root. Filter out subrepo paths so - # the count reflects what the indexer will actually process. - for info in parsers: - count = 0 - for ext in info["extensions"]: - for match in glob.glob(f"**/*{ext}", root_dir=str(root), recursive=True): - full = root / match - if skip_paths and is_under_any(full, skip_paths): - continue - count += 1 - if count > 0: - file_counts[info["lang"]] = count - - if file_counts: - console.print(" [bold]Files to index:[/bold]\n") - lang_colors = { - "python": "green", - "typescript": "blue", - "javascript": "yellow", - "terraform": "magenta", - "markdown": "cyan", - "vue": "green", - "nuxt_config": "green", - } - for lang, count in sorted(file_counts.items(), key=lambda x: -x[1]): - color = lang_colors.get(lang, "white") - bar_len = min(count // 5, 30) or 1 - bar = f"[{color}]{'>' * bar_len}[/{color}]" - console.print(f" [{color}]{lang:12s}[/{color}] {count:>5d} files {bar}") - console.print() + file_counts = _count_parseable_files(root) + _print_file_counts(file_counts) total = sum(file_counts.values()) @@ -822,13 +935,17 @@ def cmd_init(args) -> None: choice = None if owner_alive: - console.print(" [yellow]The MCP owner is running, it already watches this repo.[/yellow]") + console.print( + " [yellow]The MCP owner is running, it already watches this repo.[/yellow]" + ) if not args.yes: choice = ( questionary.select( "What do you want to do?", choices=[ - questionary.Choice(title="Skip, owner keeps the index fresh", value="skip"), + questionary.Choice( + title="Skip, owner keeps the index fresh", value="skip" + ), questionary.Choice( title="Incremental rescan through the owner (no lock fight)", value="mcp_scan", @@ -854,7 +971,9 @@ def cmd_init(args) -> None: title="Incremental (only files changed since last scan)", value="incremental", ), - questionary.Choice(title="Full scan (re-parse everything)", value="full"), + questionary.Choice( + title="Full scan (re-parse everything)", value="full" + ), questionary.Choice(title="Skip, keep as-is", value="skip"), ], style=cg_style, @@ -878,11 +997,15 @@ def cmd_init(args) -> None: if choice == "full": from codegraph.cli.commands_index import cmd_index - cmd_index(argparse.Namespace(root=str(root), verbose=False, method=default_method)) + cmd_index( + argparse.Namespace(root=str(root), verbose=False, method=default_method) + ) elif choice == "incremental": from codegraph.cli.commands_index import cmd_index - cmd_index(argparse.Namespace(root=str(root), verbose=False, method="incremental")) + cmd_index( + argparse.Namespace(root=str(root), verbose=False, method="incremental") + ) elif choice == "mcp_scan": _incremental_via_owner( root=root, @@ -892,25 +1015,20 @@ def cmd_init(args) -> None: elif choice == "reset": from codegraph.cli.commands_monitor import cmd_reset - cmd_reset(argparse.Namespace(root=str(root), yes=True, drop_extra_dirs=False, no_reindex=False)) + cmd_reset( + argparse.Namespace( + root=str(root), yes=True, drop_extra_dirs=False, no_reindex=False + ) + ) else: console.print(" [dim]Run 'cgh index' when ready.[/dim]") else: - console.print(" [dim]No parseable files found. Run 'codegraph parsers' to see supported languages.[/dim]") + console.print( + " [dim]No parseable files found. Run 'codegraph parsers' to see supported languages.[/dim]" + ) # -- Done -- - console.print() - console.print( - Panel( - "[bold]codegraph is ready![/bold]\n\n" - " [cyan]cgh stats[/cyan] View graph statistics\n" - " [cyan]cgh search X[/cyan] Find symbols\n" - " [cyan]cgh serve[/cyan] Start MCP server\n" - " [cyan]cgh parsers[/cyan] List supported languages\n" - " [cyan]cgh --help[/cyan] All commands", - border_style="green", - ) - ) + _print_init_summary() def _claude_hook_specs(cli_prefix: str) -> list[dict]: @@ -1003,7 +1121,9 @@ def _append_hook(settings: dict, spec: dict) -> None: bucket.append({"matcher": spec["matcher"], "hooks": [entry]}) -def _ensure_claude_hooks(settings_shared: dict, settings_local: dict, cli_prefix: str) -> dict: +def _ensure_claude_hooks( + settings_shared: dict, settings_local: dict, cli_prefix: str +) -> dict: """ Idempotently route each cgh hook to the right settings file. @@ -1061,7 +1181,13 @@ def _ensure_claude_hooks(settings_shared: dict, settings_local: dict, cli_prefix # "shared" → .claude/settings.json # "local" → .claude/settings.local.json _CLAUDE_HOOK_MARKERS = [ - ("cgh-reindex-on-commit", "PostToolUse", "Bash(git commit*)", "post-commit reindex", "shared"), + ( + "cgh-reindex-on-commit", + "PostToolUse", + "Bash(git commit*)", + "post-commit reindex", + "shared", + ), ("cgh-precheck-grep", "PreToolUse", "Grep", "pre-Grep symbol hint", "local"), ("cgh-precheck-read", "PreToolUse", "Read", "pre-Read outline hint", "local"), ] @@ -1086,7 +1212,10 @@ def audit_claude_integration(root: Path) -> dict: """ import json as _json - from codegraph.integrations.skill_installer import _iter_skills, detect_modified_skills + from codegraph.integrations.skill_installer import ( + _iter_skills, + detect_modified_skills, + ) report: dict = {} @@ -1188,7 +1317,15 @@ def _load_settings(path: Path) -> dict: report["overall"] = ( "ok" - if all(s["status"] == "ok" for s in (report["mcp_json"], report["hooks"], report["skills"], report["usage_block"])) + if all( + s["status"] == "ok" + for s in ( + report["mcp_json"], + report["hooks"], + report["skills"], + report["usage_block"], + ) + ) else "drift" ) return report @@ -1227,7 +1364,9 @@ def _skills_line(tool_label: str, names: list[str]) -> None: if names: plural = "s" if len(names) != 1 else "" joined = ", ".join(names) - console.print(f" [green]+[/green] {tool_label} [dim]({len(names)} skill{plural}: {joined})[/dim]") + console.print( + f" [green]+[/green] {tool_label} [dim]({len(names)} skill{plural}: {joined})[/dim]" + ) if tool == "claude": mcp_path = root / ".mcp.json" @@ -1248,8 +1387,16 @@ def _skills_line(tool_label: str, names: list[str]) -> None: shared_path = settings_dir / "settings.json" local_path = settings_dir / "settings.local.json" - shared = _json.loads(shared_path.read_text(encoding="utf-8")) if shared_path.exists() else {} - local = _json.loads(local_path.read_text(encoding="utf-8")) if local_path.exists() else {} + shared = ( + _json.loads(shared_path.read_text(encoding="utf-8")) + if shared_path.exists() + else {} + ) + local = ( + _json.loads(local_path.read_text(encoding="utf-8")) + if local_path.exists() + else {} + ) cli = mcp_entry["command"] # cgh / codegraph / python -m codegraph cli_prefix = cli if cli != sys.executable else f"{sys.executable} -m codegraph" @@ -1257,24 +1404,36 @@ def _skills_line(tool_label: str, names: list[str]) -> None: result = _ensure_claude_hooks(shared, local, cli_prefix) if result["shared_changed"]: - shared_path.write_text(_json.dumps(shared, indent=2) + "\n", encoding="utf-8") + shared_path.write_text( + _json.dumps(shared, indent=2) + "\n", encoding="utf-8" + ) if result["local_changed"]: local_path.write_text(_json.dumps(local, indent=2) + "\n", encoding="utf-8") for label in result["added"]: target_file = next( - (s["target"] for s in _claude_hook_specs(cli_prefix) if s["label"] == label), + ( + s["target"] + for s in _claude_hook_specs(cli_prefix) + if s["label"] == label + ), "shared", ) - target_name = "settings.json" if target_file == "shared" else "settings.local.json" - console.print(f" [green]+[/green] .claude/{target_name} [dim]({label})[/dim]") + target_name = ( + "settings.json" if target_file == "shared" else "settings.local.json" + ) + console.print( + f" [green]+[/green] .claude/{target_name} [dim]({label})[/dim]" + ) for label in result["moved"]: console.print( f" [yellow]~[/yellow] {label} [dim](moved to the correct settings file)[/dim]" ) # Skills, may preserve local edits if the user said so - _skills_line(".claude/skills/", install_claude(root, overwrite_modified=overwrite_skills)) + _skills_line( + ".claude/skills/", install_claude(root, overwrite_modified=overwrite_skills) + ) elif tool == "cursor": cursor_dir = root / ".cursor" @@ -1293,7 +1452,9 @@ def _skills_line(tool_label: str, names: list[str]) -> None: data = {"mcpServers": {}} data.setdefault("mcpServers", {})["codegraph"] = mcp_entry mcp_path.write_text(_json.dumps(data, indent=2) + "\n", encoding="utf-8") - console.print(" [green]+[/green] .mcp.json [dim](MCP server for Codex)[/dim]") + console.print( + " [green]+[/green] .mcp.json [dim](MCP server for Codex)[/dim]" + ) _skills_line("AGENTS.md", install_codex(root)) elif tool == "gemini": @@ -1304,7 +1465,9 @@ def _skills_line(tool_label: str, names: list[str]) -> None: data = {"mcpServers": {}} data.setdefault("mcpServers", {})["codegraph"] = mcp_entry mcp_path.write_text(_json.dumps(data, indent=2) + "\n", encoding="utf-8") - console.print(" [green]+[/green] .mcp.json [dim](MCP server for Gemini)[/dim]") + console.print( + " [green]+[/green] .mcp.json [dim](MCP server for Gemini)[/dim]" + ) _skills_line("GEMINI.md", install_gemini(root)) @@ -1351,8 +1514,12 @@ def cmd_parsers(args) -> None: ) console.print(table) - console.print(f"\n[dim]Total: {len(get_supported_extensions())} file extensions supported[/dim]") - console.print("\n[dim]To add a new parser: create a file in codegraph/parsers/[/dim]") + console.print( + f"\n[dim]Total: {len(get_supported_extensions())} file extensions supported[/dim]" + ) + console.print( + "\n[dim]To add a new parser: create a file in codegraph/parsers/[/dim]" + ) console.print("[dim]with @register_parser('.ext') and subclass BaseParser.[/dim]") diff --git a/codegraph/cli/commands_migrate.py b/codegraph/cli/commands_migrate.py index 1f5d09a..e75cb4b 100644 --- a/codegraph/cli/commands_migrate.py +++ b/codegraph/cli/commands_migrate.py @@ -11,6 +11,8 @@ from __future__ import annotations +import argparse + import os from dataclasses import dataclass from pathlib import Path @@ -97,7 +99,9 @@ def _diff_stats(kuzu: dict, duckdb: dict) -> list[tuple[str, int, int]]: differs between the two snapshots.""" diffs: list[tuple[str, int, int]] = [] for kind in ("nodes", "edges"): - all_keys = sorted(set(kuzu.get(kind, {}).keys()) | set(duckdb.get(kind, {}).keys())) + all_keys = sorted( + set(kuzu.get(kind, {}).keys()) | set(duckdb.get(kind, {}).keys()) + ) for key in all_keys: k = kuzu.get(kind, {}).get(key, 0) d = duckdb.get(kind, {}).get(key, 0) @@ -111,9 +115,11 @@ def _diff_stats(kuzu: dict, duckdb: dict) -> list[tuple[str, int, int]]: # we see is one of these, the diff is almost certainly stale-Kuzu and # not a real divergence. Order matches the graphify-inspired PRs that # introduced or fixed each edge type. -_POST_FIX_GAIN_METRICS = frozenset({ - "edges.IMPORTS", # IMPORTS edges latent bug, pre-fix Kuzu has 0 -}) +_POST_FIX_GAIN_METRICS = frozenset( + { + "edges.IMPORTS", # IMPORTS edges latent bug, pre-fix Kuzu has 0 + } +) def _classify_diff( @@ -265,7 +271,7 @@ def do_migrate_to_duckdb( ) -def cmd_migrate_to_duckdb(args) -> None: +def cmd_migrate_to_duckdb(args: argparse.Namespace) -> None: """CLI wrapper, Rich rendering on top of ``do_migrate_to_duckdb``. Handles the interactive "delete graph.db?" prompt that's specific to @@ -311,7 +317,9 @@ def cmd_migrate_to_duckdb(args) -> None: console.print("[bold]Step 1[/bold] · Reading current Kuzu graph counts...") console.print("[bold]Step 2[/bold] · Re-indexing into DuckDB...") - console.print("[bold]Step 3[/bold] · Verifying DuckDB graph against Kuzu baseline...\n") + console.print( + "[bold]Step 3[/bold] · Verifying DuckDB graph against Kuzu baseline...\n" + ) result = do_migrate_to_duckdb( args.root, delete_kuzu=delete_via_function, force=args.force @@ -327,7 +335,9 @@ def cmd_migrate_to_duckdb(args) -> None: ) if result.status == "mismatched": - diff_table = Table(box=box.SIMPLE_HEAD, title="Differing rows", title_style="bold yellow") + diff_table = Table( + box=box.SIMPLE_HEAD, title="Differing rows", title_style="bold yellow" + ) diff_table.add_column("metric", style="bold") diff_table.add_column("kuzu", justify="right") diff_table.add_column("duckdb", justify="right") @@ -347,7 +357,9 @@ def cmd_migrate_to_duckdb(args) -> None: raise SystemExit(1) if result.status == "stale_kuzu": - diff_table = Table(box=box.SIMPLE_HEAD, title="Differing rows", title_style="bold cyan") + diff_table = Table( + box=box.SIMPLE_HEAD, title="Differing rows", title_style="bold cyan" + ) diff_table.add_column("metric", style="bold") diff_table.add_column("kuzu", justify="right") diff_table.add_column("duckdb", justify="right") @@ -368,10 +380,14 @@ def cmd_migrate_to_duckdb(args) -> None: else: console.print(" [green]+[/green] node + edge counts match exactly.\n") - kuzu_size = _size_str(kuzu_path.stat().st_size) if kuzu_path.exists() else "(deleted)" + kuzu_size = ( + _size_str(kuzu_path.stat().st_size) if kuzu_path.exists() else "(deleted)" + ) duckdb_size = _size_str(duckdb_path.stat().st_size) - summary = Table(box=box.SIMPLE_HEAD, title="Migration summary", title_style="bold cyan") + summary = Table( + box=box.SIMPLE_HEAD, title="Migration summary", title_style="bold cyan" + ) summary.add_column("backend", style="bold") summary.add_column("file", overflow="fold") summary.add_column("size", justify="right") @@ -389,9 +405,13 @@ def cmd_migrate_to_duckdb(args) -> None: if kuzu_path.exists() and not args.yes: try: - answer = console.input( - f"Delete the old [bold]graph.db[/bold] ({kuzu_size})? [Y/n] " - ).strip().lower() + answer = ( + console.input( + f"Delete the old [bold]graph.db[/bold] ({kuzu_size})? [Y/n] " + ) + .strip() + .lower() + ) except EOFError: answer = "n" if answer in ("n", "no"): diff --git a/codegraph/cli/commands_monitor.py b/codegraph/cli/commands_monitor.py index c476d45..3a02411 100644 --- a/codegraph/cli/commands_monitor.py +++ b/codegraph/cli/commands_monitor.py @@ -8,6 +8,7 @@ from __future__ import annotations +import argparse import json import os import re @@ -37,7 +38,7 @@ def _build_stats_group(root: str) -> Group: return _stats_content(root) -def cmd_stats(args) -> None: +def cmd_stats(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) if getattr(args, "json", False): @@ -142,7 +143,9 @@ def _stats_content(root: str) -> Group: curr_branch = ss.get("current_branch") or "?" drift_bits = [] if behind: - drift_bits.append(f"{behind} commit{'s' if behind != 1 else ''} behind") + drift_bits.append( + f"{behind} commit{'s' if behind != 1 else ''} behind" + ) if dirty: drift_bits.append("working tree dirty") if branch != curr_branch: @@ -208,7 +211,9 @@ def _stats_content(root: str) -> Group: for edge, count in sorted(edges.items(), key=lambda x: -x[1]): edge_table.add_row(edge, f"{count:,}") edge_table.add_section() - edge_table.add_row("[bold]Total[/bold]", f"[bold]{sum(edges.values()):,}[/bold]") + edge_table.add_row( + "[bold]Total[/bold]", f"[bold]{sum(edges.values()):,}[/bold]" + ) renderables.append(edge_table) info_table = Table(box=box.SIMPLE_HEAD, title="Index Info", title_style="bold cyan") @@ -223,20 +228,31 @@ def _stats_content(root: str) -> Group: if total_size > 0: info_table.add_section() if total_size > 1024 * 1024: - info_table.add_row("[bold]Total storage[/bold]", f"[bold]{total_size / 1024 / 1024:.1f} MB[/bold]") + info_table.add_row( + "[bold]Total storage[/bold]", + f"[bold]{total_size / 1024 / 1024:.1f} MB[/bold]", + ) else: - info_table.add_row("[bold]Total storage[/bold]", f"[bold]{total_size / 1024:.0f} KB[/bold]") + info_table.add_row( + "[bold]Total storage[/bold]", f"[bold]{total_size / 1024:.0f} KB[/bold]" + ) renderables.append(info_table) if call_stats["total_calls"] > 0: - call_table = Table(title="MCP Tool Calls", box=box.SIMPLE_HEAD, title_style="bold cyan") + call_table = Table( + title="MCP Tool Calls", box=box.SIMPLE_HEAD, title_style="bold cyan" + ) call_table.add_column("Tool", style="bold") call_table.add_column("Calls", justify="right") call_table.add_column("Avg ms", justify="right") call_table.add_column("Max ms", justify="right") call_table.add_column("Errors", justify="right") - for tool, ts in sorted(call_stats.get("tools", {}).items(), key=lambda x: -x[1]["calls"]): - err_str = f"[red]{ts['errors']}[/red]" if ts["errors"] > 0 else "[dim]0[/dim]" + for tool, ts in sorted( + call_stats.get("tools", {}).items(), key=lambda x: -x[1]["calls"] + ): + err_str = ( + f"[red]{ts['errors']}[/red]" if ts["errors"] > 0 else "[dim]0[/dim]" + ) call_table.add_row( tool, str(ts["calls"]), @@ -254,7 +270,9 @@ def _stats_content(root: str) -> Group: ) renderables.append(call_table) else: - renderables.append(Text.from_markup("[dim]MCP tool calls: 0 (no calls logged yet)[/dim]")) + renderables.append( + Text.from_markup("[dim]MCP tool calls: 0 (no calls logged yet)[/dim]") + ) return Group(*renderables) @@ -324,7 +342,88 @@ def _stats_json(root: str) -> str: # --------------------------------------------------------------------------- -def cmd_status(args) -> None: +def _empty_status_source() -> dict: + """The default 'nothing resolved' status-source dict. + + counts_source == "none" is the sentinel cmd_status checks to decide + whether to fall through to the next tier. + """ + return { + "file_count": 0, + "endpoint_count": 0, + "counts_source": "none", + "fts_symbols": None, + } + + +def _status_via_owner(root: str, owner_alive: bool, owner_port: int | None) -> dict: + """Tier 1: ask a live owner for counts via the live_graph_stats MCP tool. + + Only attempted when the owner is alive and its port is known. Returns a + status-source dict, counts_source == "owner" on success, else "none". + The owner's live stats do not include an endpoint count, so it stays 0. + """ + src = _empty_status_source() + if not (owner_alive and owner_port): + return src + try: + stats = _ask_owner_live_stats(root, owner_port) + if stats is not None: + nodes = stats.get("nodes") or {} + src["file_count"] = int(nodes.get("File", 0)) + # endpoint count not in live_graph_stats, derive separately if 0 + src["fts_symbols"] = int(stats.get("fts_symbols", 0)) + src["counts_source"] = "owner" + except Exception: + pass + return src + + +def _status_via_ro_open(root: str) -> dict: + """Tier 2: open the graph DB read-only and count File / Endpoint nodes. + + Works only when no owner holds the write lock. Returns a status-source + dict, counts_source == "ro" on success, else "none". + """ + src = _empty_status_source() + try: + from codegraph.core.db import get_readonly_connection + + conn = get_readonly_connection(root) + if conn is not None: + src["file_count"] = conn.count_nodes("File") + src["endpoint_count"] = conn.count_nodes("Endpoint") + src["counts_source"] = "ro" + except Exception: + pass + return src + + +def _status_via_fts(root: str) -> dict: + """Tier 3: count symbols straight from the FTS sqlite (always RO-safe). + + Final fallback when both the owner and a RO graph open are unavailable. + Returns a status-source dict, counts_source == "fts_only" on success, + else "none". + """ + src = _empty_status_source() + try: + import sqlite3 as _sql + + fts_path = Path(root) / ".codegraph" / "fts.db" + if fts_path.exists(): + from codegraph.core.utils import ro_sqlite_uri + + c = _sql.connect(ro_sqlite_uri(fts_path), uri=True) + src["fts_symbols"] = c.execute("SELECT count(*) FROM symbols").fetchone()[0] + c.close() + src["counts_source"] = "fts_only" + except Exception: + pass + return src + + +def cmd_status(args: argparse.Namespace) -> None: """Quick one-screen health check: owner, freshness, counts, extra_dirs.""" import json as _json @@ -342,7 +441,11 @@ def cmd_status(args) -> None: owner_port = None if (Path(root) / ".codegraph").exists(): try: - owner_pid = int((Path(root) / ".codegraph" / "owner.pid").read_text(encoding="utf-8").strip()) + owner_pid = int( + (Path(root) / ".codegraph" / "owner.pid") + .read_text(encoding="utf-8") + .strip() + ) except (OSError, ValueError): pass owner_port = read_owner_port(root) @@ -366,13 +469,17 @@ def cmd_status(args) -> None: ) stats = _ask_owner_incremental_reindex(root, owner_port) if stats is None: - console.print("[yellow]Refresh call failed (timeout or error).[/yellow]\n") + console.print( + "[yellow]Refresh call failed (timeout or error).[/yellow]\n" + ) else: rx = stats.get("reindexed_count", stats.get("indexed", 0)) un = stats.get("unchanged_count", stats.get("skipped", 0)) de = stats.get("deleted_count", 0) el = stats.get("elapsed_s", "?") - console.print(f"[green]✓[/green] reindexed={rx}, unchanged={un}, deleted={de}, elapsed={el}s\n") + console.print( + f"[green]✓[/green] reindexed={rx}, unchanged={un}, deleted={de}, elapsed={el}s\n" + ) # Scan freshness (re-read after possible refresh) ss = _scan_status(root) @@ -384,50 +491,20 @@ def cmd_status(args) -> None: # (live_graph_stats), authoritative + cheap. # 2. Else try a local RO open (works only when no owner is up). # 3. As a final fallback, read the FTS sqlite (always RO-safe). - file_count = endpoint_count = 0 - counts_source = "none" - fts_symbols: int | None = None + # Each tier is a helper returning a status-source dict, we take the + # first one that resolved (counts_source != "none"). + src = _status_via_owner(root, owner_alive, owner_port) + if src["counts_source"] == "none": + src = _status_via_ro_open(root) + if src["counts_source"] == "none": + src = _status_via_fts(root) + + file_count = src["file_count"] + endpoint_count = src["endpoint_count"] + counts_source = src["counts_source"] + fts_symbols = src["fts_symbols"] extra_dirs: list[str] = [] - if owner_alive and owner_port: - try: - stats = _ask_owner_live_stats(root, owner_port) - if stats is not None: - nodes = stats.get("nodes") or {} - file_count = int(nodes.get("File", 0)) - # endpoint count not in live_graph_stats, derive separately if 0 - fts_symbols = int(stats.get("fts_symbols", 0)) - counts_source = "owner" - except Exception: - pass - - if counts_source == "none": - try: - from codegraph.core.db import get_readonly_connection - - conn = get_readonly_connection(root) - if conn is not None: - file_count = conn.count_nodes("File") - endpoint_count = conn.count_nodes("Endpoint") - counts_source = "ro" - except Exception: - pass - - if counts_source == "none": - try: - import sqlite3 as _sql - - fts_path = Path(root) / ".codegraph" / "fts.db" - if fts_path.exists(): - from codegraph.core.utils import ro_sqlite_uri - - c = _sql.connect(ro_sqlite_uri(fts_path), uri=True) - fts_symbols = c.execute("SELECT count(*) FROM symbols").fetchone()[0] - c.close() - counts_source = "fts_only" - except Exception: - pass - try: import tomllib @@ -442,7 +519,11 @@ def cmd_status(args) -> None: # which children this owner fans queries out to and whether they're up. subrepos: list[dict] = [] try: - from codegraph.analysis.federation import child_owner_status, resolve_children, verify_child + from codegraph.analysis.federation import ( + child_owner_status, + resolve_children, + verify_child, + ) for child in resolve_children(root): st = verify_child(child) @@ -518,17 +599,22 @@ def cmd_status(args) -> None: elif payload["scan"]["indexed_sha"]: drift = [] if ss.get("behind_by"): - drift.append(f"{ss['behind_by']} commit{'s' if ss['behind_by'] != 1 else ''} behind") + drift.append( + f"{ss['behind_by']} commit{'s' if ss['behind_by'] != 1 else ''} behind" + ) if ss.get("dirty"): drift.append("working tree dirty") scan_line = ( f"[yellow]stale[/yellow] indexed [bold]{payload['scan']['indexed_sha']}[/bold] → " - f"HEAD [bold]{payload['scan']['current_sha']}[/bold]" + (f" ({', '.join(drift)})" if drift else "") + f"HEAD [bold]{payload['scan']['current_sha']}[/bold]" + + (f" ({', '.join(drift)})" if drift else "") ) else: scan_line = "[dim]no scan recorded, run cgh index[/dim]" - table = Table(box=box.SIMPLE_HEAD, title="codegraph status", title_style="bold cyan") + table = Table( + box=box.SIMPLE_HEAD, title="codegraph status", title_style="bold cyan" + ) table.add_column("", style="bold") table.add_column("", overflow="fold") table.add_row("Version", f"[cyan]{VERSION}[/cyan]") @@ -544,13 +630,15 @@ def cmd_status(args) -> None: endpoints_cell = f"{endpoint_count:,}" elif counts_source == "fts_only": files_cell = f"[dim]graph locked[/dim]{fts_suffix}" - endpoints_cell = "[dim], [/dim]" + endpoints_cell = "[dim]unknown (graph locked)[/dim]" else: files_cell = "[dim]unknown[/dim]" - endpoints_cell = "[dim], [/dim]" + endpoints_cell = "[dim]unknown[/dim]" table.add_row("Files", files_cell) table.add_row("Endpoints", endpoints_cell) - table.add_row("Extra dirs", ", ".join(extra_dirs) if extra_dirs else "[dim]none[/dim]") + table.add_row( + "Extra dirs", ", ".join(extra_dirs) if extra_dirs else "[dim]none[/dim]" + ) table.add_row("Subrepos", _format_subrepos_cell(subrepos)) console.print(table) @@ -624,7 +712,7 @@ def _backend_status_line(root: str) -> str: f"[dim]none on disk[/dim] " f"[dim](CGH_DB={env_value!r}, next `cgh index` writes a {env_backend} DB)[/dim]" ) - return "[dim]none on disk[/dim] [dim](would create graph.db)[/dim]" + return "[dim]none on disk[/dim] [dim](would create graph.duckdb)[/dim]" def _size(p: Path) -> str: size = p.stat().st_size @@ -637,7 +725,10 @@ def _size(p: Path) -> str: colour = "green" if kind == "duckdb" else "magenta" label = f"[{colour}]{kind}[/{colour}] ([dim]{path.name}, {_size(path)}[/dim])" if env_was_set and env_backend != kind: - return label + f" [yellow]CGH_DB={env_value!r} mismatch, next index would create a {env_backend} DB[/yellow]" + return ( + label + + f" [yellow]CGH_DB={env_value!r} mismatch, next index would create a {env_backend} DB[/yellow]" + ) if kind == "kuzu": # Gentle nudge toward DuckDB, about 18x faster + 5x smaller # on the wb-backend stress test. Opt-in, not automatic. @@ -783,9 +874,17 @@ def _print_workers_table(worker_pids: list[int], owner_pid: int | None) -> None: # One ps call, many PIDs → fewer subprocess invocations try: r = subprocess.run( - ["ps", "-o", "pid=,tty=,lstart=,command=", "-p", ",".join(str(p) for p in pids)], + [ + "ps", + "-o", + "pid=,tty=,lstart=,command=", + "-p", + ",".join(str(p) for p in pids), + ], capture_output=True, - text=True, encoding="utf-8", errors="replace", + text=True, + encoding="utf-8", + errors="replace", timeout=3, ) lines = [ln.rstrip() for ln in r.stdout.splitlines() if ln.strip()] @@ -835,7 +934,7 @@ def _print_workers_table(worker_pids: list[int], owner_pid: int | None) -> None: # --------------------------------------------------------------------------- -def cmd_reset(args) -> None: +def cmd_reset(args: argparse.Namespace) -> None: """ Nuke the graph + FTS DBs, kill the owner, then optionally re-index and re-publish. Use after a schema migration or when the graph gets @@ -887,7 +986,9 @@ def cmd_reset(args) -> None: targets.append(p) # Kuzu also writes .wal / .tmp / shm files for p in cg_dir.iterdir(): - if p.is_file() and (p.name.startswith("graph.db") or p.name.startswith("fts.db")): + if p.is_file() and ( + p.name.startswith("graph.db") or p.name.startswith("fts.db") + ): if p not in targets: targets.append(p) # Workers dir + port + pid files (leftovers) @@ -960,7 +1061,7 @@ def _pid_alive(pid: int) -> bool: # --------------------------------------------------------------------------- -def cmd_tail(args) -> None: +def cmd_tail(args: argparse.Namespace) -> None: """Live view of scan/watcher activity. Works while MCP server is running.""" import datetime as _dt import time as _t @@ -1010,7 +1111,9 @@ def _build(): console.print("[dim]Tailing codegraph activity (Ctrl-C to stop)[/dim]\n") try: - with Live(_build(), console=console, refresh_per_second=2, screen=False) as live: + with Live( + _build(), console=console, refresh_per_second=2, screen=False + ) as live: while True: _t.sleep(0.5) live.update(_build()) @@ -1023,7 +1126,7 @@ def _build(): # --------------------------------------------------------------------------- -def cmd_logs(args) -> None: +def cmd_logs(args: argparse.Namespace) -> None: from codegraph.state.call_log import clear_logs, get_logs root = os.path.abspath(args.root) @@ -1063,14 +1166,24 @@ def cmd_logs(args) -> None: table.add_column("Args", max_width=40, overflow="ellipsis") for entry in logs: - status = Text("OK", style="green") if entry["success"] else Text("ERR", style="bold red") + status = ( + Text("OK", style="green") + if entry["success"] + else Text("ERR", style="bold red") + ) try: parsed = json.loads(entry["args"]) args_str = " ".join(f"{k}={v}" for k, v in parsed.items())[:40] except (json.JSONDecodeError, TypeError): args_str = entry["args"][:40] - latency_style = "red" if entry["latency_ms"] > 100 else "yellow" if entry["latency_ms"] > 20 else "green" + latency_style = ( + "red" + if entry["latency_ms"] > 100 + else "yellow" + if entry["latency_ms"] > 20 + else "green" + ) table.add_row( entry["timestamp"], @@ -1089,7 +1202,7 @@ def cmd_logs(args) -> None: # --------------------------------------------------------------------------- -def cmd_history(args) -> None: +def cmd_history(args: argparse.Namespace) -> None: """Show recent indexing activity grouped by day.""" from datetime import datetime, timedelta @@ -1101,7 +1214,9 @@ def cmd_history(args) -> None: console.print(LOGO) if not log_path.exists(): - console.print("[dim]No call log found. MCP tools have not been called yet.[/dim]") + console.print( + "[dim]No call log found. MCP tools have not been called yet.[/dim]" + ) return try: @@ -1163,7 +1278,9 @@ def cmd_history(args) -> None: # Grand total grand_total = sum(r[1] for r in rows) grand_errors = sum(r[2] for r in rows) - console.print(f"\n[dim]Total: {grand_total} calls, {grand_errors} errors across {len(rows)} day(s)[/dim]") + console.print( + f"\n[dim]Total: {grand_total} calls, {grand_errors} errors across {len(rows)} day(s)[/dim]" + ) # --------------------------------------------------------------------------- @@ -1171,7 +1288,7 @@ def cmd_history(args) -> None: # --------------------------------------------------------------------------- -def cmd_diff(args) -> None: +def cmd_diff(args: argparse.Namespace) -> None: """Show files changed since last index.""" import subprocess @@ -1191,7 +1308,9 @@ def cmd_diff(args) -> None: result = subprocess.run( ["git", "diff", "--name-only", since], capture_output=True, - text=True, encoding="utf-8", errors="replace", + text=True, + encoding="utf-8", + errors="replace", cwd=root, ) changed_files = [f for f in result.stdout.strip().splitlines() if f] @@ -1204,7 +1323,9 @@ def cmd_diff(args) -> None: result_untracked = subprocess.run( ["git", "ls-files", "--others", "--exclude-standard"], capture_output=True, - text=True, encoding="utf-8", errors="replace", + text=True, + encoding="utf-8", + errors="replace", cwd=root, ) untracked_files = [f for f in result_untracked.stdout.strip().splitlines() if f] @@ -1222,7 +1343,9 @@ def cmd_diff(args) -> None: other_changed.append(f) # Categorize untracked files - parseable_untracked = [f for f in untracked_files if Path(f).suffix.lower() in supported] + parseable_untracked = [ + f for f in untracked_files if Path(f).suffix.lower() in supported + ] if not changed_files and not parseable_untracked: console.print(f"[dim]No changes since {since}.[/dim]") @@ -1246,7 +1369,9 @@ def cmd_diff(args) -> None: # Show non-parseable changed files if other_changed: - console.print(f"\n[dim] + {len(other_changed)} non-parseable changed file(s)[/dim]") + console.print( + f"\n[dim] + {len(other_changed)} non-parseable changed file(s)[/dim]" + ) # Show untracked parseable if parseable_untracked: @@ -1282,7 +1407,7 @@ def cmd_diff(args) -> None: # --------------------------------------------------------------------------- -def cmd_doctor(args) -> None: +def cmd_doctor(args: argparse.Namespace) -> None: """Health check, verify all codegraph components are working.""" import shutil @@ -1296,7 +1421,13 @@ def cmd_doctor(args) -> None: # 1. .codegraph/ exists cg_exists = codegraph_dir.exists() and codegraph_dir.is_dir() - checks.append((".codegraph/ directory", cg_exists, "initialized" if cg_exists else "run 'cgh init' first")) + checks.append( + ( + ".codegraph/ directory", + cg_exists, + "initialized" if cg_exists else "run 'cgh init' first", + ) + ) # 2. graph.db accessible graph_ok = False @@ -1386,7 +1517,13 @@ def cmd_doctor(args) -> None: # 8. .cghignore exists cghignore_path = root / ".cghignore" cghignore_ok = cghignore_path.exists() - checks.append((".cghignore", cghignore_ok, "found" if cghignore_ok else "not found (optional)")) + checks.append( + ( + ".cghignore", + cghignore_ok, + "found" if cghignore_ok else "not found (optional)", + ) + ) # 9. MCP server (fastmcp import) mcp_ok = False @@ -1496,7 +1633,11 @@ def _print_claude_audit(audit: dict) -> None: if h.get("misplaced"): bits.append(f"wrong file: {', '.join(h['misplaced'])}") detail = f"[yellow]{h['installed']}/{h['expected']} installed[/yellow] [dim]({'; '.join(bits)})[/dim]" - tbl.add_row(".claude hooks (settings + local)", icon.get(h["status"], "[dim]?[/dim]"), detail) + tbl.add_row( + ".claude hooks (settings + local)", + icon.get(h["status"], "[dim]?[/dim]"), + detail, + ) s = audit["skills"] bits = [f"{s['installed']}/{s['bundled']} installed"] @@ -1504,7 +1645,11 @@ def _print_claude_audit(audit: dict) -> None: bits.append(f"missing: {', '.join(s['missing'])}") if s["modified"]: bits.append(f"modified: {', '.join(s['modified'])}") - detail = ("[green]" if s["status"] == "ok" else "[yellow]") + ", ".join(bits) + ("[/green]" if s["status"] == "ok" else "[/yellow]") + detail = ( + ("[green]" if s["status"] == "ok" else "[yellow]") + + ", ".join(bits) + + ("[/green]" if s["status"] == "ok" else "[/yellow]") + ) tbl.add_row(".claude/skills/", icon.get(s["status"], "[dim]?[/dim]"), detail) u = audit["usage_block"] @@ -1532,7 +1677,7 @@ def _print_claude_audit(audit: dict) -> None: # --------------------------------------------------------------------------- -def cmd_compact(args) -> None: +def cmd_compact(args: argparse.Namespace) -> None: """Vacuum SQLite DBs and show before/after sizes.""" root = os.path.abspath(args.root) codegraph_dir = Path(root) / ".codegraph" @@ -1593,7 +1738,9 @@ def _fmt_size(size_bytes: int) -> str: for db_name, before, after in results: saved = before - after total_saved += saved - saved_str = f"[green]-{_fmt_size(saved)}[/green]" if saved > 0 else "[dim]0[/dim]" + saved_str = ( + f"[green]-{_fmt_size(saved)}[/green]" if saved > 0 else "[dim]0[/dim]" + ) table.add_row(db_name, _fmt_size(before), _fmt_size(after), saved_str) if graph_size > 0: diff --git a/codegraph/cli/commands_query.py b/codegraph/cli/commands_query.py index 99a5d4a..b739f74 100644 --- a/codegraph/cli/commands_query.py +++ b/codegraph/cli/commands_query.py @@ -8,6 +8,8 @@ from __future__ import annotations +import argparse + import json import os from pathlib import Path @@ -23,7 +25,7 @@ # --------------------------------------------------------------------------- -def cmd_grep(args) -> None: +def cmd_grep(args: argparse.Namespace) -> None: """Regex/substring pattern search across the indexed repo.""" import json as _json @@ -47,7 +49,9 @@ def cmd_grep(args) -> None: "glob": args.glob or None, "backend": backend, "total": len(hits), - "hits": [{"file": h.file, "line": h.line, "text": h.text} for h in hits], + "hits": [ + {"file": h.file, "line": h.line, "text": h.text} for h in hits + ], }, indent=2, ) @@ -64,7 +68,7 @@ def cmd_grep(args) -> None: console.print(f" [cyan]{short}[/cyan]:[yellow]{h.line}[/yellow] {h.text}") -def cmd_search(args) -> None: +def cmd_search(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) query = args.query limit = args.limit @@ -84,10 +88,16 @@ def cmd_search(args) -> None: for hit in fts_search(fts_conn, query, limit=fetch): results.append((hit.kind, hit.name, hit.file_path, hit.start_line)) except Exception as exc: - console.print(f"[yellow]Graph DB locked and FTS unavailable: {exc}[/yellow]") + console.print( + f"[yellow]Graph DB locked and FTS unavailable: {exc}[/yellow]" + ) return else: - for label, kind in [("Function", "function"), ("Class", "class"), ("MdSection", "md_section")]: + for label, kind in [ + ("Function", "function"), + ("Class", "class"), + ("MdSection", "md_section"), + ]: # Kuzu Cypher requires literal labels, safe: fixed allowlist if label == "MdSection": q = ( @@ -122,16 +132,22 @@ def cmd_search(args) -> None: "returned": len(page), "has_more": has_more, "next_offset": offset + limit if has_more else None, - "results": [{"kind": k, "name": n, "file": fp, "line": ln} for k, n, fp, ln in page], + "results": [ + {"kind": k, "name": n, "file": fp, "line": ln} for k, n, fp, ln in page + ], } print(json.dumps(out, indent=2)) return if not page: if offset > 0: - console.print(f"[dim]No more results for '[/dim][bold]{query}[/bold][dim]' at offset {offset}[/dim]") + console.print( + f"[dim]No more results for '[/dim][bold]{query}[/bold][dim]' at offset {offset}[/dim]" + ) else: - console.print(f"[dim]No symbols matching '[/dim][bold]{query}[/bold][dim]'[/dim]") + console.print( + f"[dim]No symbols matching '[/dim][bold]{query}[/bold][dim]'[/dim]" + ) return table = Table(box=box.SIMPLE_HEAD, title=f"Search: {query}", title_style="bold") @@ -139,7 +155,11 @@ def cmd_search(args) -> None: table.add_column("Symbol", style="bold") table.add_column("Location", style="dim") - icons = {"function": "[green]fn[/green]", "class": "[yellow]cls[/yellow]", "md_section": "[cyan]doc[/cyan]"} + icons = { + "function": "[green]fn[/green]", + "class": "[yellow]cls[/yellow]", + "md_section": "[cyan]doc[/cyan]", + } for kind, name, fp, line in page: short = _short_path(fp, root) table.add_row(icons.get(kind, kind), name, f"{short}:{line}") @@ -161,7 +181,7 @@ def cmd_search(args) -> None: # --------------------------------------------------------------------------- -def cmd_lookup(args) -> None: +def cmd_lookup(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) name = args.name found = False @@ -189,7 +209,9 @@ def cmd_lookup(args) -> None: f" {icon} [bold]{hit.name}[/bold] [dim]{short}:{hit.start_line}-{hit.end_line}[/dim]" ) except Exception as exc: - console.print(f"[yellow]Graph DB locked and FTS unavailable: {exc}[/yellow]") + console.print( + f"[yellow]Graph DB locked and FTS unavailable: {exc}[/yellow]" + ) return else: for label, kind in [ @@ -204,7 +226,11 @@ def cmd_lookup(args) -> None: "RETURN n.title AS name, n.file_path, n.start_line, n.end_line" ) else: - q = "MATCH (n:" + label + ") WHERE n.name = $q RETURN n.name, n.file_path, n.start_line, n.end_line" + q = ( + "MATCH (n:" + + label + + ") WHERE n.name = $q RETURN n.name, n.file_path, n.start_line, n.end_line" + ) r = conn.execute(q, {"q": name}) for row in _rows(r): found = True @@ -214,10 +240,14 @@ def cmd_lookup(args) -> None: el = row.get("n.end_line", row.get("end_line", "?")) icon = icons.get(kind, kind) short = _short_path(fp, root) - console.print(f" {icon} [bold]{n}[/bold] [dim]{short}:{sl}-{el}[/dim]") + console.print( + f" {icon} [bold]{n}[/bold] [dim]{short}:{sl}-{el}[/dim]" + ) if not found: - console.print(f"[dim]No symbol found matching '[/dim][bold]{name}[/bold][dim]'[/dim]") + console.print( + f"[dim]No symbol found matching '[/dim][bold]{name}[/bold][dim]'[/dim]" + ) # --------------------------------------------------------------------------- @@ -225,11 +255,13 @@ def cmd_lookup(args) -> None: # --------------------------------------------------------------------------- -def cmd_callers(args) -> None: +def cmd_callers(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) conn = _get_conn(root, readonly=True) if conn is None: - console.print("[yellow]Graph DB is locked (indexing?). Try again later.[/yellow]") + console.print( + "[yellow]Graph DB is locked (indexing?). Try again later.[/yellow]" + ) return r = conn.execute( "MATCH (caller:Function)-[:CALLS]->(callee:Function) " @@ -239,13 +271,17 @@ def cmd_callers(args) -> None: ) rows = _rows(r) if not rows: - console.print(f"[dim]No callers of '[/dim][bold]{args.fn_name}[/bold][dim]' found[/dim]") + console.print( + f"[dim]No callers of '[/dim][bold]{args.fn_name}[/bold][dim]' found[/dim]" + ) return tree = Tree(f"[bold yellow]{args.fn_name}[/bold yellow] [dim]is called by:[/dim]") for row in rows: short = _short_path(row["caller.file_path"], root) - tree.add(f"[green]{row['caller.name']}[/green] [dim]{short}:{row['caller.start_line']}[/dim]") + tree.add( + f"[green]{row['caller.name']}[/green] [dim]{short}:{row['caller.start_line']}[/dim]" + ) console.print(tree) @@ -254,11 +290,13 @@ def cmd_callers(args) -> None: # --------------------------------------------------------------------------- -def cmd_callees(args) -> None: +def cmd_callees(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) conn = _get_conn(root, readonly=True) if conn is None: - console.print("[yellow]Graph DB is locked (indexing?). Try again later.[/yellow]") + console.print( + "[yellow]Graph DB is locked (indexing?). Try again later.[/yellow]" + ) return r = conn.execute( "MATCH (caller:Function)-[:CALLS]->(callee:Function) " @@ -274,7 +312,9 @@ def cmd_callees(args) -> None: tree = Tree(f"[bold green]{args.fn_name}[/bold green] [dim]calls:[/dim]") for row in rows: short = _short_path(row["callee.file_path"], root) - tree.add(f"[yellow]{row['callee.name']}[/yellow] [dim]{short}:{row['callee.start_line']}[/dim]") + tree.add( + f"[yellow]{row['callee.name']}[/yellow] [dim]{short}:{row['callee.start_line']}[/dim]" + ) console.print(tree) @@ -283,11 +323,13 @@ def cmd_callees(args) -> None: # --------------------------------------------------------------------------- -def cmd_outline(args) -> None: +def cmd_outline(args: argparse.Namespace) -> None: root = os.path.abspath(args.root) conn = _get_conn(root, readonly=True) if conn is None: - console.print("[yellow]Graph DB is locked (indexing?). Try again later.[/yellow]") + console.print( + "[yellow]Graph DB is locked (indexing?). Try again later.[/yellow]" + ) return file_path = args.file if not os.path.isabs(file_path): @@ -327,7 +369,14 @@ def cmd_outline(args) -> None: node_stack.pop() parent = node_stack[-1][1] - level_colors = {1: "bold cyan", 2: "green", 3: "yellow", 4: "dim", 5: "dim", 6: "dim"} + level_colors = { + 1: "bold cyan", + 2: "green", + 3: "yellow", + 4: "dim", + 5: "dim", + 6: "dim", + } style = level_colors.get(level, "dim") child = parent.add(f"[{style}]{title}[/{style}] [dim]L{line}[/dim]") diff --git a/codegraph/core/config.py b/codegraph/core/config.py index 887bf9e..eaa3b38 100644 --- a/codegraph/core/config.py +++ b/codegraph/core/config.py @@ -47,6 +47,21 @@ CLAUDE_HOME = Path.home() / ".claude" +def find_codegraph_root(start: "str | Path") -> "Path | None": + """Walk up from ``start`` to the nearest ancestor that has a .codegraph/ + directory, the way git finds its repo root via .git. Returns that + directory, or None if none is found up to the filesystem root. + + This lets every read command work from a subdirectory of an initialized + repo: a file deep in the tree still knows it belongs to the cgh root. + """ + p = Path(start).resolve() + for d in [p, *p.parents]: + if (d / CODEGRAPH_DIR).is_dir(): + return d + return None + + def _claude_project_slug_from_abs(abs_path: str) -> str: """Slug Claude Code uses for ~/.claude/projects//. @@ -159,6 +174,12 @@ class CodegraphConfig: max_file_size_kb: int = 500 # Dirs to force-index even if gitignored (relative to project_root or absolute). include_dirs: list[str] = field(default_factory=list) + # Opt-in precise CALLS resolution for Python via jedi (proof of concept, + # see codegraph/analysis/precise_calls.py). Off by default: when False, or + # when the optional `jedi` extra is not installed, the indexer keeps using + # the name-matched resolver and behavior is unchanged. Enable with this + # flag in config.toml or the CGH_PRECISE_CALLS env var. + precise_calls: bool = False # Parsers enabled_parsers: list[str] | None = None # None = all available @@ -231,6 +252,9 @@ def load_config(project_root: str | Path | None = None) -> CodegraphConfig: if os.environ.get("CODEGRAPH_RUFLO_ENABLED"): config.ruflo_enabled = os.environ["CODEGRAPH_RUFLO_ENABLED"].lower() in ("1", "true", "yes") + if os.environ.get("CGH_PRECISE_CALLS"): + config.precise_calls = os.environ["CGH_PRECISE_CALLS"].lower() in ("1", "true", "yes") + return config @@ -245,6 +269,8 @@ def _apply_toml(config: CodegraphConfig, data: dict) -> None: config.max_file_size_kb = cg["max_file_size_kb"] if "include_dirs" in cg: config.include_dirs = list(cg["include_dirs"]) + if "precise_calls" in cg: + config.precise_calls = bool(cg["precise_calls"]) if "log_max_mb" in cg: config.log_max_mb = int(cg["log_max_mb"]) if "log_backup_count" in cg: @@ -289,6 +315,10 @@ def generate_default_config() -> str: # Paths are relative to the project root. Use absolute paths for dirs that # live outside the repo (sibling repos prefer add_directory / extra_dirs). # include_dirs = ["docs", "internal/specs"] +# Opt-in precise CALLS resolution for Python (requires `pip install cgh[lsp]`). +# Off by default; uses jedi for goto-definition so cross-file call edges are +# exact instead of name-matched. Env override: CGH_PRECISE_CALLS=1 +# precise_calls = false [parsers] # Uncomment to restrict which parsers are active: @@ -360,6 +390,13 @@ def init_project(root: Path) -> dict: cg_dir.mkdir(parents=True) created.append(str(cg_dir)) + # Restrict the index dir to the owner: auth.key lives here and is the + # whole loopback-auth boundary. No-op on filesystems without POSIX modes. + try: + cg_dir.chmod(0o700) + except OSError: + pass + config_path = cg_dir / CONFIG_FILE if not config_path.exists(): config_path.write_text(generate_default_config(), encoding="utf-8") diff --git a/codegraph/core/db.py b/codegraph/core/db.py index 4a75f4e..416cbf9 100644 --- a/codegraph/core/db.py +++ b/codegraph/core/db.py @@ -264,8 +264,13 @@ def reset_connection() -> None: continue try: obj.close() - except Exception: - pass + except Exception as exc: + # A close that fails on the owner's shutdown path can leave the + # file lock lingering; surface it instead of swallowing silently. + print( + f"[codegraph] warning: failed to close {type(obj).__name__}: {exc}", + file=sys.stderr, + ) _conn = None _db = None _ro_conn = None diff --git a/codegraph/core/db_duckdb.py b/codegraph/core/db_duckdb.py index b0f78d1..b6742ae 100644 --- a/codegraph/core/db_duckdb.py +++ b/codegraph/core/db_duckdb.py @@ -174,7 +174,12 @@ def purge_file_data(self, file_path: str) -> None: f"DELETE FROM {edge.table} WHERE {column} IN ({placeholders})", ids, ) - if edge.dst_label == spec.label and edge.src_label != spec.label: + # Also purge the inbound side. For self-referential edges + # (CALLS/INHERITS Function->Function) src and dst share a label + # but use different columns (from_id/to_id), so this removes + # stale callers pointing INTO this file's symbols, matching + # Kuzu's DETACH DELETE. Without it, find_callers keeps ghosts. + if edge.dst_label == spec.label: column = edge.dst_column if column.endswith("_path"): self._conn.execute( diff --git a/codegraph/core/utils.py b/codegraph/core/utils.py index 0297b0e..1413d91 100644 --- a/codegraph/core/utils.py +++ b/codegraph/core/utils.py @@ -8,20 +8,26 @@ from __future__ import annotations +import sys import unicodedata from pathlib import Path def rows(result) -> list[dict]: - """Convert a Kuzu query result to a list of dicts.""" + """Convert a Kuzu query result to a list of dicts. + + Stays resilient (returns whatever rows were read) but no longer fails + silently: an unexpected error here used to masquerade as an empty result + and hide query bugs, so it is logged to stderr. + """ out: list[dict] = [] try: col_names = result.get_column_names() while result.has_next(): row = result.get_next() out.append(dict(zip(col_names, row))) - except Exception: - pass + except Exception as exc: + print(f"[codegraph] warning: rows() failed: {exc}", file=sys.stderr) return out diff --git a/codegraph/indexer.py b/codegraph/indexer.py index 2425a8a..ea69d02 100644 --- a/codegraph/indexer.py +++ b/codegraph/indexer.py @@ -33,14 +33,18 @@ if sys.getrecursionlimit() < _RECURSION_LIMIT: sys.setrecursionlimit(_RECURSION_LIMIT) -_fts_conn = None +# Keyed by resolved repo root: one owner process can touch several repos +# (federation, tests), and a single global conn would return the wrong DB. +_fts_conns: dict[str, object] = {} def _get_fts(repo_root): - global _fts_conn - if _fts_conn is None: - _fts_conn = get_fts_conn(repo_root) - return _fts_conn + key = str(Path(repo_root).resolve()) + conn = _fts_conns.get(key) + if conn is None: + conn = get_fts_conn(repo_root) + _fts_conns[key] = conn + return conn _IGNORE_DIRS = { @@ -56,20 +60,29 @@ def _get_fts(repo_root): ".next", } +# Per-import symbol-edge cap: above this a barrel re-export collapses to a +# single whole-module IMPORTS edge instead of one edge per named symbol. +_MAX_IMPORT_SYMBOLS = 50 + _CGHIGNORE_FILE = ".cghignore" -_cghignore_patterns: list[str] | None = None +# Keyed by resolved repo root for the same reason as _fts_conns: a global +# cache would leak one repo's patterns into another in a multi-repo process. +# Patterns are read once per repo per process; editing .cghignore needs a +# restart to take effect. +_cghignore_cache: dict[str, list[str]] = {} def _load_cghignore(repo_root: Path) -> list[str]: - """Load .cghignore patterns (gitignore syntax). Cached after first load.""" - global _cghignore_patterns - if _cghignore_patterns is not None: - return _cghignore_patterns + """Load .cghignore patterns (gitignore syntax). Cached per repo root.""" + key = str(Path(repo_root).resolve()) + cached = _cghignore_cache.get(key) + if cached is not None: + return cached ignore_file = repo_root / _CGHIGNORE_FILE if not ignore_file.exists(): - _cghignore_patterns = [] - return _cghignore_patterns + _cghignore_cache[key] = [] + return _cghignore_cache[key] patterns = [] for line in ignore_file.read_text(encoding="utf-8").splitlines(): @@ -81,8 +94,8 @@ def _load_cghignore(repo_root: Path) -> list[str]: continue patterns.append(line) - _cghignore_patterns = patterns - return _cghignore_patterns + _cghignore_cache[key] = patterns + return patterns def _is_cghignored(file_path: Path, repo_root: Path) -> bool: @@ -189,7 +202,7 @@ def _fts_ingest(fts_conn, idx: FileIndex) -> None: docstring=cls.docstring, ) for res in idx.resources: - kind = f"tf_{res.kind}" if res.kind in ("variable", "output") else f"tf_{res.kind}" + kind = f"tf_{res.kind}" upsert_symbol( fts_conn, sym_id=res.id, @@ -226,11 +239,24 @@ def _resolve_calls(conn, functions: list, lang: str = "") -> None: """ from codegraph.parsers.builtins import is_builtin + # Memoize name -> candidate ids for this file's resolution pass: many + # functions in a file call the same names, and the Function node set is + # stable while we only add edges. + name_cache: dict[str, list[str]] = {} for fn in functions: + same_file_prefix = f"{fn.file_path}::" for called_name in fn.calls: if lang and is_builtin(lang, called_name): continue - for callee_id in conn.find_node_keys("Function", "name", called_name): + candidates = name_cache.get(called_name) + if candidates is None: + candidates = list(conn.find_node_keys("Function", "name", called_name)) + name_cache[called_name] = candidates + # Prefer a definition in the same file. A local call like run() + # almost never means "every function named run in the repo"; only + # fan out across files when there is no same-file match. + same_file = [c for c in candidates if c.startswith(same_file_prefix)] + for callee_id in same_file or candidates: conn.ensure_edge("CALLS", fn.id, callee_id) @@ -242,7 +268,43 @@ def _resolve_inherits(conn, classes: list) -> None: conn.ensure_edge("INHERITS", cls.id, parent_id) -def _ingest_code(conn, idx: FileIndex) -> None: +def _precise_calls_enabled(cfg, lang: str) -> bool: + """True only when the user opted in AND jedi is importable AND this is a + Python file. Any of these missing keeps the name-matched resolver, so the + default install behaves exactly as before. + """ + if cfg is None or not getattr(cfg, "precise_calls", False): + return False + if lang != "python": + return False + from codegraph.analysis.precise_calls import jedi_available + + return jedi_available() + + +def _resolve_calls_precise(conn, idx: FileIndex, repo_root: Path) -> bool: + """Create CALLS edges for one Python file using the jedi-backed resolver. + + Returns True when it ran (even with zero edges), False when it could not + run and the caller should fall back to the name-matched resolver. Never + raises: any error returns False so resolution degrades to the old path. + """ + try: + from codegraph.analysis.precise_calls import resolve_calls_for_file + + edges = resolve_calls_for_file(idx.path, repo_root) + except Exception: + return False + + for caller_id, _target_file, callee_id in edges: + try: + conn.ensure_edge("CALLS", caller_id, callee_id) + except Exception: + continue + return True + + +def _ingest_code(conn, idx: FileIndex, cfg=None, repo_root: Path | None = None) -> None: """Ingest functions, classes, and their edges (Python, TypeScript, Vue, etc.).""" for fn in idx.functions: conn.upsert_node( @@ -279,7 +341,14 @@ def _ingest_code(conn, idx: FileIndex) -> None: class_id = f"{fn.file_path}::{fn.class_name}" conn.ensure_edge("HAS_METHOD", class_id, fn.id) - _resolve_calls(conn, idx.functions, idx.lang) + # Precise CALLS (opt-in, Python only, jedi installed). When it runs we + # skip the name-matched resolver for this file so edges aren't doubled. + # Any failure or the flag being off falls straight back to the old path. + used_precise = False + if repo_root is not None and _precise_calls_enabled(cfg, idx.lang): + used_precise = _resolve_calls_precise(conn, idx, repo_root) + if not used_precise: + _resolve_calls(conn, idx.functions, idx.lang) _resolve_inherits(conn, idx.classes) @@ -316,8 +385,12 @@ def _ingest_imports(conn, idx: FileIndex, repo_root: Path | None) -> None: # Symbol annotation on the edge. If the import named multiple # symbols, write one edge per symbol so MCP tools can answer # "who imports name X". Single edge with empty symbol when the - # import is a whole-module pull. + # import is a whole-module pull. A barrel re-export can name + # hundreds of symbols, so collapse past a cap to one whole-module + # edge rather than flooding the graph with per-symbol edges. symbols = imp.symbols if imp.symbols else [""] + if len(symbols) > _MAX_IMPORT_SYMBOLS: + symbols = [""] for sym in symbols: edge_key = f"{target_str}::{sym}" if edge_key in seen_targets: @@ -430,10 +503,11 @@ def _ingest_markdown(conn, idx: FileIndex) -> None: conn.ensure_edge("CONTAINS_SECTION", parent.id, child.id) # Internal links: link markdown sections to files they reference. - # The original Cypher used `WHERE f.path ENDS WITH $tp` which has no - # direct find_node_keys equivalent; for now resolve via find_node_keys - # over all files and filter in Python, small N (file count) makes - # this cheap. + # Markdown links are written relative to the file that contains them, so + # resolve each target against this file's directory before matching the + # (absolute) File node path. This makes ./foo.md and ../api.md resolve, + # where the old raw exact-match on "./foo.md" never did. + md_dir = os.path.dirname(idx.path) for link in idx.links: target = link.target if target.startswith(("http://", "https://", "mailto:", "#")): @@ -441,12 +515,11 @@ def _ingest_markdown(conn, idx: FileIndex) -> None: target_path = target.split("#")[0] if not target_path: continue + resolved_target = os.path.normpath(os.path.join(md_dir, target_path)) section = _find_section_for_line(idx.sections, link.line) if not section: continue - for file_key in conn.find_node_keys("File", "path", target_path): - # Exact match path. We can extend with ENDS WITH later when a - # backend-neutral suffix-match helper exists. + for file_key in conn.find_node_keys("File", "path", resolved_target): conn.ensure_edge( "MD_LINKS_TO", section.id, file_key, {"label": link.label} ) @@ -486,6 +559,7 @@ def index_file( repo_root: str | Path | None = None, force: bool = False, git_blob_sha: str | None = None, + cfg=None, ) -> bool: """ Parse and ingest a single file into the graph. @@ -496,6 +570,9 @@ def index_file( repo_root: Repository root (default: CWD). force: If True, index even if the file is in .gitignore or .git/info/exclude. Skips mtime cache check too, always re-parses. + cfg: Pre-loaded CodegraphConfig. index_repo passes one so the size / + ignore-pattern gate doesn't re-read config.toml per file. When + None (standalone callers) it is loaded once for this call. """ path = Path(path) suffix = path.suffix.lower() @@ -512,6 +589,23 @@ def index_file( if not force and _is_cghignored(path, root): return False + # Respect the configured size cap + ignore_patterns (skip if force). These + # were defined and documented but never enforced, so a huge minified or + # generated file would still be fully read and tree-sitter parsed. + if not force: + import fnmatch as _fnmatch + + from codegraph.core.config import load_config + + _cfg = cfg if cfg is not None else load_config(root) + if any(_fnmatch.fnmatch(path.name, pat) for pat in _cfg.ignore_patterns): + return False + try: + if path.stat().st_size > _cfg.max_file_size_kb * 1024: + return False + except OSError: + pass + conn = get_connection(repo_root) mtime = path.stat().st_mtime @@ -580,9 +674,20 @@ def index_file( module_doc=module_doc, ) + # Resolve the effective config once for ingest. index_repo threads its + # pre-loaded cfg in; standalone / force callers get a fresh load. Only + # consulted for the opt-in precise_calls flag below, so the cost is paid + # only when something actually reads it. + if cfg is not None: + eff_cfg = cfg + else: + from codegraph.core.config import load_config as _load_config_for_ingest + + eff_cfg = _load_config_for_ingest(root) + # Ingest into graph if idx.functions or idx.classes: - _ingest_code(conn, idx) + _ingest_code(conn, idx, cfg=eff_cfg, repo_root=root) if idx.resources: _ingest_terraform(conn, idx) if idx.sections: @@ -696,9 +801,22 @@ def _discover_find(repo_root: Path) -> list[Path]: """Use GNU `find -type f` for file discovery (fast on large repos).""" import subprocess + # Prune the heavy ignore dirs at the walk level so find doesn't descend + # into node_modules/.venv/etc.; the post-hoc _IGNORE_DIRS check below stays + # as a backstop for anything the prune misses. + prune: list[str] = [] + for d in sorted(_IGNORE_DIRS): + prune += ["-name", d, "-o"] + prune = prune[:-1] # drop the trailing -o + cmd = [ + "find", str(repo_root), + "(", "-type", "d", "(", *prune, ")", "-prune", ")", + "-o", "(", "-type", "f", "-not", "-path", "*/.*", "-print", ")", + ] + try: r = subprocess.run( - ["find", str(repo_root), "-type", "f", "-not", "-path", "*/.*"], + cmd, capture_output=True, text=True, encoding="utf-8", errors="replace", timeout=60, @@ -763,7 +881,9 @@ def _discover_git_diff(repo_root: Path) -> tuple[list[Path], list[Path]]: capture_output=True, text=True, encoding="utf-8", errors="replace", cwd=str(repo_root), - timeout=10, + # Match git ls-files (30s): a large rebase diff was timing out at + # 10s and silently falling back to a full scan. + timeout=30, ) changed: list[Path] = [] deleted: list[Path] = [] @@ -880,6 +1000,15 @@ def index_repo( if git_files is None: actual_method = "os_walk" + # Load config once for the whole scan (size cap + ignore patterns), then + # thread it into every index_file call so config.toml is read a single + # time per scan instead of once per file. + from codegraph.core.config import load_config as _load_config + + scan_cfg = _load_config(repo_root) + _size_cap = scan_cfg.max_file_size_kb * 1024 + import fnmatch as _fnmatch + # Filter to parseable, existing files parseable: list[Path] = [] for p in candidates: @@ -892,6 +1021,15 @@ def index_repo( if not p.exists(): stats["skipped"] += 1 continue + if any(_fnmatch.fnmatch(p.name, pat) for pat in scan_cfg.ignore_patterns): + stats["skipped"] += 1 + continue + try: + if p.stat().st_size > _size_cap: + stats["skipped"] += 1 + continue + except OSError: + pass parseable.append(p) if on_discovery: @@ -908,8 +1046,10 @@ def index_repo( conn.delete_file_completely(str(gone)) if fts_conn is not None: delete_file_symbols(fts_conn, str(gone)) - except Exception: - pass + except Exception as exc: + # A failed delete leaves a ghost file in the graph; record it + # rather than silently reporting a clean scan. + _activity_log(repo_root, "scan_error", f"delete {gone}: {exc}") # ------------------------------------------------------------------ # Index loop (shared across all methods) @@ -922,7 +1062,7 @@ def index_repo( sha = blob_shas.get(rel) if sha is None: sha = _git_hash(repo_root, full_path) - ok = index_file(full_path, repo_root, git_blob_sha=sha) + ok = index_file(full_path, repo_root, git_blob_sha=sha, cfg=scan_cfg) status = "indexed" if ok else "error" if ok: stats["indexed"] += 1 diff --git a/codegraph/parsers/__init__.py b/codegraph/parsers/__init__.py index d09cfe8..468a35e 100644 --- a/codegraph/parsers/__init__.py +++ b/codegraph/parsers/__init__.py @@ -128,13 +128,33 @@ def get_parser_for_path(path: str | Path) -> BaseParser | None: # --------------------------------------------------------------------------- +# Parser modules whose tree-sitter grammar is an OPTIONAL extra +# (`pip install cgh[langs]`). When the extra is not installed, importing the +# module raises ImportError/ModuleNotFoundError because its grammar package is +# absent. That is expected: we skip the module and the parser simply never +# registers, so cgh keeps working exactly as before. Any OTHER import error +# (a real bug in a hard-dep parser) still propagates. +_OPTIONAL_GRAMMAR_MODULES = frozenset({"csharp", "ruby"}) + + def _discover_parsers(): - """Import all parser modules in this package.""" + """Import all parser modules in this package. + + Optional-grammar modules (see ``_OPTIONAL_GRAMMAR_MODULES``) are skipped + when their grammar package is missing, instead of crashing discovery. + """ package_dir = Path(__file__).parent for _, module_name, _ in pkgutil.iter_modules([str(package_dir)]): if module_name == "base": continue - importlib.import_module(f".{module_name}", package=__package__) + try: + importlib.import_module(f".{module_name}", package=__package__) + except (ImportError, ModuleNotFoundError): + # A parser whose optional grammar package is not installed: skip it. + # Re-raise for non-optional modules so genuine breakage stays loud. + if module_name in _OPTIONAL_GRAMMAR_MODULES: + continue + raise _discover_parsers() @@ -148,5 +168,7 @@ def _discover_parsers(): ["docker-compose.yml", "docker-compose.yaml", "compose.yml", "compose.yaml"], ".yaml", ) -register_by_name([".env.example", ".env.local", ".env.staging", ".env.production"], ".env") +register_by_name( + [".env.example", ".env.local", ".env.staging", ".env.production"], ".env" +) register_by_name(["Makefile", "GNUmakefile"], ".sh") diff --git a/codegraph/parsers/config_data.py b/codegraph/parsers/config_data.py new file mode 100644 index 0000000..65a41a5 --- /dev/null +++ b/codegraph/parsers/config_data.py @@ -0,0 +1,250 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Config-as-data parsers for JSON / TOML / YAML. +# Each top-level key (and one nested level) becomes a section in +# the EXISTING MdSection model, so config files are searchable and +# show up in the graph without any new node types. JSON uses the +# stdlib json module, TOML uses tomllib, and YAML uses PyYAML when +# it is importable, falling back to an indentation scan otherwise. +# Parsing never raises: a malformed file yields a partial or empty +# FileIndex built from a best-effort line scan. + +from __future__ import annotations + +import json +import re +import tomllib +from pathlib import Path +from typing import Any + +from . import register_parser +from .base import BaseParser, FileIndex, SectionDef + +# --------------------------------------------------------------------------- +# Shared helpers +# --------------------------------------------------------------------------- + +_MAX_SECTIONS = 500 # guard against pathological configs + + +def _preview(value: Any) -> str: + """Short, single-line summary of a config value for the section body.""" + if isinstance(value, dict): + keys = ", ".join(str(k) for k in list(value)[:10]) + return f"{{{keys}}}" if keys else "{}" + if isinstance(value, list): + items = ", ".join(str(v) for v in value[:8] if not isinstance(v, dict | list)) + return f"[{items}]" if items else f"[{len(value)} items]" + text = str(value) + return re.sub(r"\s+", " ", text)[:200] + + +def _add_section( + idx: FileIndex, + path_str: str, + title: str, + level: int, + line: int, + body: str, +) -> None: + if len(idx.sections) >= _MAX_SECTIONS: + return + sec_id = f"{path_str}::{title}" + if any(s.id == sec_id for s in idx.sections): + sec_id = f"{path_str}::{title}-L{line}" + idx.sections.append( + SectionDef( + id=sec_id, + title=title, + level=level, + file_path=path_str, + start_line=line, + end_line=line, + body_preview=body, + anchor=title, + ) + ) + + +def _line_of_key(lines: list[str], key: str) -> int: + """Best-effort source line for a top-level key (1-based, defaults to 1).""" + # JSON/TOML/YAML all write the key near the start of its line. + pat = re.compile(rf"""^\s*['"]?{re.escape(str(key))}['"]?\s*[:=\[]""") + for i, line in enumerate(lines, start=1): + if pat.match(line): + return i + return 1 + + +def _sections_from_mapping( + idx: FileIndex, + path_str: str, + data: dict, + lines: list[str], +) -> None: + """Turn a parsed mapping into sections: every top-level key, plus one + nested level for dict values (e.g. package.json scripts.).""" + for key, value in data.items(): + line = _line_of_key(lines, key) + _add_section(idx, path_str, str(key), 1, line, _preview(value)) + if isinstance(value, dict): + for sub in list(value)[:50]: + _add_section( + idx, + path_str, + f"{key}.{sub}", + 2, + _line_of_key(lines, sub), + _preview(value[sub]), + ) + + +# --------------------------------------------------------------------------- +# JSON +# --------------------------------------------------------------------------- + + +@register_parser(".json", ".jsonc") +class JsonParser(BaseParser): + """JSON / JSONC config files. Top-level keys (and one nested level) become + sections, so package.json scripts and tsconfig options stay searchable.""" + + lang = "json" + extensions = [".json", ".jsonc"] + extracts = ["sections"] + description = "JSON / JSONC config files" + + def parse(self, path: Path) -> FileIndex: + path_str = str(path) + idx = FileIndex(path=path_str, lang=self.lang) + try: + text = path.read_text(encoding="utf-8", errors="replace") + except OSError: + return idx + lines = text.splitlines() + try: + data = json.loads(text) + except (ValueError, RecursionError): + return idx # malformed JSON, indexed as a bare File node + if isinstance(data, dict): + _sections_from_mapping(idx, path_str, data, lines) + return idx + + +# --------------------------------------------------------------------------- +# TOML +# --------------------------------------------------------------------------- + + +@register_parser(".toml") +class TomlParser(BaseParser): + """TOML config files (pyproject.toml, config.toml). Top tables and one + nested level become sections.""" + + lang = "toml" + extensions = [".toml"] + extracts = ["sections"] + description = "TOML config files" + + def parse(self, path: Path) -> FileIndex: + path_str = str(path) + idx = FileIndex(path=path_str, lang=self.lang) + try: + text = path.read_text(encoding="utf-8", errors="replace") + except OSError: + return idx + lines = text.splitlines() + try: + data = tomllib.loads(text) + except (tomllib.TOMLDecodeError, ValueError, RecursionError): + # Fall back to a bracket scan so we still surface [table] headers. + for i, line in enumerate(lines, start=1): + m = re.match(r"^\s*\[+([^\]\n]+?)\]*\s*$", line) + if m and m.group(1).strip(): + _add_section(idx, path_str, m.group(1).strip(), 1, i, "") + return idx + if isinstance(data, dict): + _sections_from_mapping(idx, path_str, data, lines) + return idx + + +# --------------------------------------------------------------------------- +# YAML +# --------------------------------------------------------------------------- + +try: # PyYAML is an installed dependency in this environment, but stay soft. + import yaml as _yaml +except ImportError: # pragma: no cover - exercised only without PyYAML + _yaml = None + + +def _yaml_top_keys_scan(idx: FileIndex, path_str: str, lines: list[str]) -> None: + """Indentation-based fallback: column-0 `key:` lines become sections.""" + for i, line in enumerate(lines, start=1): + if line[:1] in ("#", " ", "\t", "") or line.startswith("-"): + continue + m = re.match(r"^([A-Za-z_][\w.\-/]*)\s*:", line) + if m: + _add_section(idx, path_str, m.group(1), 1, i, "") + + +@register_parser(".yaml", ".yml") +class YamlParser(BaseParser): + """YAML config files. Parses with PyYAML when available (top-level keys plus + one nested level: GitHub Actions jobs, compose services, k8s spec keys), + falling back to an indentation scan of column-0 keys when it is not.""" + + lang = "yaml" + extensions = [".yaml", ".yml"] + extracts = ["sections"] + description = "YAML config files" + + def parse(self, path: Path) -> FileIndex: + path_str = str(path) + idx = FileIndex(path=path_str, lang=self.lang) + try: + text = path.read_text(encoding="utf-8", errors="replace") + except OSError: + return idx + lines = text.splitlines() + + if _yaml is None: + _yaml_top_keys_scan(idx, path_str, lines) + return idx + + try: + docs = list(_yaml.safe_load_all(text)) + except Exception: + # Any YAML error: degrade to the line scan, never raise. + _yaml_top_keys_scan(idx, path_str, lines) + return idx + + seen_any = False + for data in docs: + if not isinstance(data, dict): + continue + seen_any = True + # k8s manifests: surface kind/metadata.name as a leading section. + kind = data.get("kind") + name = None + meta = data.get("metadata") + if isinstance(meta, dict): + name = meta.get("name") + if isinstance(kind, str) and isinstance(name, str): + _add_section( + idx, + path_str, + f"{kind}/{name}", + 1, + _line_of_key(lines, "kind"), + _preview(data), + ) + _sections_from_mapping(idx, path_str, data, lines) + + if not seen_any: + _yaml_top_keys_scan(idx, path_str, lines) + return idx diff --git a/codegraph/parsers/csharp.py b/codegraph/parsers/csharp.py new file mode 100644 index 0000000..c4525b3 --- /dev/null +++ b/codegraph/parsers/csharp.py @@ -0,0 +1,190 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: C# parser plugin. Extracts classes, interfaces, structs, enums, +# methods, using directives (imports), and call references using +# tree-sitter-c-sharp. Optional: ships behind the `langs` extra, +# so the grammar import only happens when the extra is installed. + +from __future__ import annotations + +import re +from pathlib import Path + +import tree_sitter_c_sharp as tscs +from tree_sitter import Language, Node, Parser + +from . import register_parser +from .base import BaseParser, ClassDef, FileIndex, ImportRef, SymbolDef + +CSHARP_LANGUAGE = Language(tscs.language()) +_parser = Parser(CSHARP_LANGUAGE) + +# Type declarations that map to a ClassDef, with their codegraph kind. +_TYPE_DECLS = { + "class_declaration": "class", + "interface_declaration": "interface", + "struct_declaration": "struct", + "enum_declaration": "enum", + "record_declaration": "record", +} + + +def _text(node: Node, src: bytes) -> str: + return src[node.start_byte : node.end_byte].decode("utf-8", errors="replace") + + +def _ident(node: Node, src: bytes) -> str: + from codegraph.core.utils import normalize_identifier + + return normalize_identifier(_text(node, src)) + + +def _collect_calls(node: Node, src: bytes) -> list[str]: + """Walk a C# method body, return called method names (deduped). + + Covers invocation_expression (`obj.Method()`, `Method()`) and + object_creation_expression (`new Foo()`). + """ + calls: list[str] = [] + visited: set[int] = set() + + def walk(n: Node) -> None: + if id(n) in visited: + return + visited.add(id(n)) + if n.type == "invocation_expression": + fn = n.child_by_field_name("function") + if fn: + name = _ident(fn, src) + # `obj.Method` or `Foo.Bar.Method` -> last segment + if "." in name: + name = name.split(".")[-1] + if re.match(r"^\w+$", name, re.UNICODE): + calls.append(name) + elif n.type == "object_creation_expression": + type_node = n.child_by_field_name("type") + if type_node: + name = _ident(type_node, src) + if "." in name: + name = name.split(".")[-1] + if re.match(r"^\w+$", name, re.UNICODE): + calls.append(name) + for child in n.children: + walk(child) + + walk(node) + return list(dict.fromkeys(calls)) + + +@register_parser(".cs") +class CSharpParser(BaseParser): + """Tree-sitter parser for C# source files.""" + + lang = "csharp" + extensions = [".cs"] + extracts = ["classes", "interfaces", "methods", "imports", "calls"] + description = "C# source files (.cs)" + tree_sitter_lang = "c_sharp" + + def parse(self, path: Path) -> FileIndex: + path = Path(path) + path_str = str(path) + src = path.read_bytes() + tree = _parser.parse(src) + root = tree.root_node + + index = FileIndex(path=path_str, lang=self.lang) + + def _emit_method(method_node: Node, current_class: str | None) -> None: + name_node = method_node.child_by_field_name("name") + name = _ident(name_node, src) if name_node else "?" + body_node = method_node.child_by_field_name("body") + calls = _collect_calls(body_node, src) if body_node else [] + fn_id = ( + f"{path_str}::{current_class}.{name}" + if current_class + else f"{path_str}::{name}" + ) + index.functions.append( + SymbolDef( + id=fn_id, + name=name, + file_path=path_str, + start_line=method_node.start_point[0] + 1, + end_line=method_node.end_point[0] + 1, + docstring="", + class_name=current_class, + calls=calls, + kind="constructor" + if method_node.type == "constructor_declaration" + else "method", + ) + ) + + def _emit_type(decl: Node, kind: str) -> None: + name_node = decl.child_by_field_name("name") + if not name_node: + return + name = _ident(name_node, src) + bases: list[str] = [] + # `base_list` (`: Base, IFoo`) is a positional child, not a named + # field, so look it up by node type. + for child in decl.children: + if child.type == "base_list": + for b in child.children: + if b.type in ("identifier", "qualified_name", "generic_name"): + bases.append(_ident(b, src)) + break + index.classes.append( + ClassDef( + id=f"{path_str}::{name}", + name=name, + file_path=path_str, + start_line=decl.start_point[0] + 1, + end_line=decl.end_point[0] + 1, + docstring="", + bases=bases, + kind=kind, + ) + ) + body = decl.child_by_field_name("body") + if body: + for child in body.children: + if child.type in ("method_declaration", "constructor_declaration"): + _emit_method(child, name) + elif child.type in _TYPE_DECLS: + _emit_type(child, _TYPE_DECLS[child.type]) + + def _emit_using(decl: Node) -> None: + # `using System;` / `using System.Collections.Generic;` + # Skip the leading `using` keyword and any alias `=`; the module is + # the identifier or qualified_name child. + for child in decl.children: + if child.type in ("identifier", "qualified_name"): + mod = _text(child, src) + if mod: + index.imports.append(ImportRef(source_module=mod, symbols=[])) + return + + def _walk(node: Node) -> None: + # Namespaces (block or file-scoped) just wrap declarations, so + # recurse into them rather than treating them as types. + for child in node.children: + t = child.type + if t == "using_directive": + _emit_using(child) + elif t in _TYPE_DECLS: + _emit_type(child, _TYPE_DECLS[t]) + elif t in ( + "namespace_declaration", + "file_scoped_namespace_declaration", + "declaration_list", + ): + _walk(child) + + _walk(root) + return index diff --git a/codegraph/parsers/plaintext.py b/codegraph/parsers/plaintext.py index 98f823e..686b6c0 100644 --- a/codegraph/parsers/plaintext.py +++ b/codegraph/parsers/plaintext.py @@ -17,7 +17,7 @@ from pathlib import Path from . import register_parser -from .base import BaseParser, FileIndex, ResourceDef, SectionDef +from .base import BaseParser, FileIndex, ResourceDef # --------------------------------------------------------------------------- # Dockerfile parser, extracts FROM stages @@ -58,89 +58,13 @@ def parse(self, path: Path) -> FileIndex: # --------------------------------------------------------------------------- -# YAML / TOML, extracts top-level keys as sections -# --------------------------------------------------------------------------- - - -@register_parser(".yaml", ".yml") -class YamlParser(BaseParser): - """YAML config files, extracts top-level keys.""" - - lang = "yaml" - extensions = [".yaml", ".yml"] - extracts = ["sections"] - - def parse(self, path: Path) -> FileIndex: - idx = FileIndex(path=str(path), lang=self.lang) - try: - text = path.read_text(encoding="utf-8", errors="replace") - except OSError: - return idx - for i, line in enumerate(text.splitlines(), 1): - m = re.match(r"^([a-zA-Z_][\w.-]*)\s*:", line) - if m: - key = m.group(1) - idx.sections.append( - SectionDef( - id=f"{path}::{key}", - title=key, - level=1, - file_path=str(path), - start_line=i, - end_line=i, - ) - ) - return idx - - -@register_parser(".toml") -class TomlParser(BaseParser): - """TOML config files, extracts [section] headers.""" - - lang = "toml" - extensions = [".toml"] - extracts = ["sections"] - - def parse(self, path: Path) -> FileIndex: - idx = FileIndex(path=str(path), lang=self.lang) - try: - text = path.read_text(encoding="utf-8", errors="replace") - except OSError: - return idx - for i, line in enumerate(text.splitlines(), 1): - m = re.match(r"^\[([^\]]+)\]", line) - if m: - section = m.group(1) - idx.sections.append( - SectionDef( - id=f"{path}::{section}", - title=section, - level=1, - file_path=str(path), - start_line=i, - end_line=i, - ) - ) - return idx - - -# --------------------------------------------------------------------------- -# JSON / XML, file node only (no symbol extraction) +# XML, file node only (no symbol extraction) +# +# YAML / TOML / JSON live in config_data.py and SQL lives in sql.py: they parse +# structured config into sections / resources instead of bare File nodes. # --------------------------------------------------------------------------- -@register_parser(".json", ".jsonc") -class JsonParser(BaseParser): - """JSON/JSONC data files, indexed as File nodes.""" - - lang = "json" - extensions = [".json", ".jsonc"] - extracts = [] - - def parse(self, path: Path) -> FileIndex: - return FileIndex(path=str(path), lang=self.lang) - - @register_parser(".xml", ".xsl", ".xslt", ".svg") class XmlParser(BaseParser): """XML/SVG files, indexed as File nodes.""" @@ -201,15 +125,3 @@ def parse(self, path: Path) -> FileIndex: ) ) return idx - - -@register_parser(".sql") -class SqlParser(BaseParser): - """SQL files, indexed as File nodes.""" - - lang = "sql" - extensions = [".sql"] - extracts = [] - - def parse(self, path: Path) -> FileIndex: - return FileIndex(path=str(path), lang=self.lang) diff --git a/codegraph/parsers/ruby.py b/codegraph/parsers/ruby.py new file mode 100644 index 0000000..43e6638 --- /dev/null +++ b/codegraph/parsers/ruby.py @@ -0,0 +1,175 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Ruby parser plugin. Extracts classes and modules, methods +# (`def` and `def self.`), require / require_relative (imports), +# and call references using tree-sitter-ruby. Optional: ships +# behind the `langs` extra, so the grammar import only happens +# when the extra is installed. + +from __future__ import annotations + +import re +from pathlib import Path + +import tree_sitter_ruby as tsr +from tree_sitter import Language, Node, Parser + +from . import register_parser +from .base import BaseParser, ClassDef, FileIndex, ImportRef, SymbolDef + +RUBY_LANGUAGE = Language(tsr.language()) +_parser = Parser(RUBY_LANGUAGE) + +_REQUIRE_NAMES = {"require", "require_relative", "load", "autoload"} + + +def _text(node: Node, src: bytes) -> str: + return src[node.start_byte : node.end_byte].decode("utf-8", errors="replace") + + +def _ident(node: Node, src: bytes) -> str: + from codegraph.core.utils import normalize_identifier + + return normalize_identifier(_text(node, src)) + + +def _string_value(node: Node, src: bytes) -> str: + """Pull the literal text out of a `string` node, dropping the quotes.""" + for child in node.children: + if child.type == "string_content": + return _text(child, src) + return _text(node, src).strip("\"'") + + +def _collect_calls(node: Node, src: bytes) -> list[str]: + """Walk a Ruby method body, return called method names (deduped). + + A Ruby `call` node is `recv.method(args)` or a bare `method(args)`; + the method name sits in the `method` field (or is the lone identifier + for a paren-less call). + """ + calls: list[str] = [] + visited: set[int] = set() + + def walk(n: Node) -> None: + if id(n) in visited: + return + visited.add(id(n)) + if n.type == "call": + method = n.child_by_field_name("method") + if method is not None: + name = _ident(method, src) + if re.match(r"^\w+[?!]?$", name, re.UNICODE): + calls.append(name.rstrip("?!")) + for child in n.children: + walk(child) + + walk(node) + return list(dict.fromkeys(calls)) + + +@register_parser(".rb") +class RubyParser(BaseParser): + """Tree-sitter parser for Ruby source files.""" + + lang = "ruby" + extensions = [".rb"] + extracts = ["classes", "modules", "methods", "imports", "calls"] + description = "Ruby source files (.rb)" + tree_sitter_lang = "ruby" + + def parse(self, path: Path) -> FileIndex: + path = Path(path) + path_str = str(path) + src = path.read_bytes() + tree = _parser.parse(src) + root = tree.root_node + + index = FileIndex(path=path_str, lang=self.lang) + + def _emit_method(method_node: Node, current_class: str | None) -> None: + name_node = method_node.child_by_field_name("name") + name = _ident(name_node, src) if name_node else "?" + calls = _collect_calls(method_node, src) + is_singleton = method_node.type == "singleton_method" + fn_id = ( + f"{path_str}::{current_class}.{name}" + if current_class + else f"{path_str}::{name}" + ) + index.functions.append( + SymbolDef( + id=fn_id, + name=name, + file_path=path_str, + start_line=method_node.start_point[0] + 1, + end_line=method_node.end_point[0] + 1, + docstring="", + class_name=current_class, + calls=calls, + kind="singleton_method" if is_singleton else "method", + ) + ) + + def _emit_type(decl: Node, kind: str) -> None: + name_node = decl.child_by_field_name("name") + if not name_node: + return + name = _ident(name_node, src) + bases: list[str] = [] + if kind == "class": + superclass = decl.child_by_field_name("superclass") + if superclass: + for child in superclass.children: + if child.type in ("constant", "scope_resolution"): + bases.append(_ident(child, src)) + index.classes.append( + ClassDef( + id=f"{path_str}::{name}", + name=name, + file_path=path_str, + start_line=decl.start_point[0] + 1, + end_line=decl.end_point[0] + 1, + docstring="", + bases=bases, + kind=kind, + ) + ) + body = decl.child_by_field_name("body") + if body: + for child in body.children: + _dispatch(child, name) + + def _emit_require(call_node: Node) -> None: + method = call_node.child_by_field_name("method") + if not method or _ident(method, src) not in _REQUIRE_NAMES: + return + args = call_node.child_by_field_name("arguments") + if not args: + return + for arg in args.children: + if arg.type == "string": + mod = _string_value(arg, src) + if mod: + index.imports.append(ImportRef(source_module=mod, symbols=[])) + return + + def _dispatch(node: Node, current_class: str | None) -> None: + t = node.type + if t == "class": + _emit_type(node, "class") + elif t == "module": + _emit_type(node, "module") + elif t in ("method", "singleton_method"): + _emit_method(node, current_class) + elif t == "call": + _emit_require(node) + + for node in root.children: + _dispatch(node, None) + + return index diff --git a/codegraph/parsers/sql.py b/codegraph/parsers/sql.py new file mode 100644 index 0000000..5523c7f --- /dev/null +++ b/codegraph/parsers/sql.py @@ -0,0 +1,182 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: SQL / migration DDL parser. Scans CREATE TABLE and +# ALTER TABLE ... ADD COLUMN statements with regex (DDL only, no +# attempt at arbitrary queries) and represents each table as a +# section titled `table:`, listing its columns in the body +# preview. Reuses the existing MdSection model, so no new graph +# node types or schema changes are needed. Parsing never raises. + +from __future__ import annotations + +import re +from pathlib import Path + +from . import register_parser +from .base import BaseParser, FileIndex, SectionDef + +# --------------------------------------------------------------------------- +# Regex patterns (DDL only) +# --------------------------------------------------------------------------- + +# CREATE TABLE [IF NOT EXISTS] [schema.]name ( -- captures the bare table name +_CREATE_TABLE = re.compile( + r"""CREATE\s+(?:TEMP(?:ORARY)?\s+)?TABLE\s+(?:IF\s+NOT\s+EXISTS\s+)? + (?P[`"\[]?[\w.]+[`"\]]?) + \s*\(""", + re.IGNORECASE | re.VERBOSE, +) + +# ALTER TABLE [schema.]name ADD [COLUMN] colname coltype +_ALTER_ADD = re.compile( + r"""ALTER\s+TABLE\s+(?P[`"\[]?[\w.]+[`"\]]?)\s+ + ADD\s+(?:COLUMN\s+)?(?P[`"\[]?\w+[`"\]]?)""", + re.IGNORECASE | re.VERBOSE, +) + +# Names of table-level constraints that are not columns. +_CONSTRAINT_KW = { + "primary", + "foreign", + "unique", + "constraint", + "check", + "key", + "index", +} + + +def _clean(name: str) -> str: + """Strip quoting/backticks/brackets and a schema prefix off an identifier.""" + name = name.strip().strip('`"[]') + return name.split(".")[-1] + + +def _split_columns(body: str) -> list[str]: + """Split the parenthesised body of a CREATE TABLE into top-level column + definitions, respecting nested parens (e.g. NUMERIC(10, 2)).""" + parts: list[str] = [] + depth = 0 + current: list[str] = [] + for ch in body: + if ch == "(": + depth += 1 + current.append(ch) + elif ch == ")": + depth -= 1 + current.append(ch) + elif ch == "," and depth == 0: + parts.append("".join(current).strip()) + current = [] + else: + current.append(ch) + if current: + parts.append("".join(current).strip()) + return [p for p in parts if p] + + +def _column_names(body: str) -> list[str]: + """Pull column names out of a CREATE TABLE body, skipping constraints.""" + cols: list[str] = [] + for part in _split_columns(body): + first = part.split(None, 1) + if not first: + continue + token = first[0] + if token.strip('`"[]').lower() in _CONSTRAINT_KW: + continue + name = _clean(token) + if name and re.match(r"^\w+$", name): + cols.append(name) + return cols + + +def _matching_paren(text: str, open_idx: int) -> int: + """Index of the ) that closes the ( at *open_idx*, or len(text) on failure.""" + depth = 0 + for i in range(open_idx, len(text)): + if text[i] == "(": + depth += 1 + elif text[i] == ")": + depth -= 1 + if depth == 0: + return i + return len(text) + + +@register_parser(".sql") +class SqlParser(BaseParser): + """SQL DDL parser. CREATE TABLE -> a `table:` section listing its + columns; ALTER TABLE ... ADD COLUMN folds extra columns into that table.""" + + lang = "sql" + extensions = [".sql"] + extracts = ["sections"] + description = "SQL DDL (CREATE / ALTER TABLE)" + + def parse(self, path: Path) -> FileIndex: + path_str = str(path) + idx = FileIndex(path=path_str, lang=self.lang) + try: + text = path.read_text(encoding="utf-8", errors="replace") + except OSError: + return idx + + # table name -> (start_line, [columns]) + tables: dict[str, tuple[int, list[str]]] = {} + order: list[str] = [] + + for m in _CREATE_TABLE.finditer(text): + name = _clean(m.group("name")) + if not name: + continue + line = text[: m.start()].count("\n") + 1 + open_idx = m.end() - 1 # the "(" matched at the end of the regex + close_idx = _matching_paren(text, open_idx) + body = text[open_idx + 1 : close_idx] + cols = _column_names(body) + if name in tables: + # Same table redefined; merge columns, keep first line. + existing_line, existing_cols = tables[name] + merged = list(dict.fromkeys(existing_cols + cols)) + tables[name] = (existing_line, merged) + else: + tables[name] = (line, cols) + order.append(name) + + for m in _ALTER_ADD.finditer(text): + name = _clean(m.group("name")) + col = _clean(m.group("col")) + if not name or not col: + continue + line = text[: m.start()].count("\n") + 1 + if name in tables: + start_line, cols = tables[name] + if col not in cols: + cols.append(col) + tables[name] = (start_line, cols) + else: + tables[name] = (line, [col]) + order.append(name) + + for name in order: + start_line, cols = tables[name] + preview = "columns: " + ", ".join(cols) if cols else "columns: (none)" + idx.sections.append( + SectionDef( + id=f"{path_str}::table:{name}", + title=f"table:{name}", + level=1, + file_path=path_str, + start_line=start_line, + end_line=start_line, + body_preview=preview[:300], + anchor=name, + ) + ) + + return idx diff --git a/codegraph/server/__init__.py b/codegraph/server/__init__.py index 140b94d..73cacf9 100644 --- a/codegraph/server/__init__.py +++ b/codegraph/server/__init__.py @@ -186,16 +186,21 @@ def _short_path(path: str) -> str: # Register tools from sub-modules (must be after mcp = FastMCP) from codegraph.server.tools_arch import register as _register_arch # noqa: E402 from codegraph.server.tools_docs import register as _register_docs # noqa: E402 +from codegraph.server.tools_history import register as _register_history # noqa: E402 from codegraph.server.tools_index import register as _register_index # noqa: E402 +from codegraph.server.tools_insight import register as _register_insight # noqa: E402 from codegraph.server.tools_knowledge import register as _register_knowledge # noqa: E402 from codegraph.server.tools_memory import register as _register_memory # noqa: E402 from codegraph.server.tools_meta import register as _register_meta # noqa: E402 from codegraph.server.tools_plans import register as _register_plans # noqa: E402 from codegraph.server.tools_query import register as _register_query # noqa: E402 +from codegraph.server.tools_tests import register as _register_tests # noqa: E402 from codegraph.server.tools_viz import register as _register_viz # noqa: E402 _register_arch(mcp) # architecture_overview, domain_map, endpoints, use FIRST _register_query(mcp) +_register_insight(mcp) # file_summary, impact_of, path_between, import_cycles +_register_tests(mcp) # tests_for, untested _register_docs(mcp) _register_index(mcp) _register_viz(mcp) @@ -203,6 +208,7 @@ def _short_path(path: str) -> str: _register_memory(mcp) _register_plans(mcp) _register_knowledge(mcp) +_register_history(mcp) # hotspots, who_knows # --------------------------------------------------------------------------- @@ -223,8 +229,12 @@ def main() -> None: ap = argparse.ArgumentParser(description="codegraph MCP server (proxy mode)") ap.add_argument("--root", default=os.getcwd(), help="Repo root (default: CWD)") - ap.add_argument("--watch", action="store_true", help="Request file watcher in the owner") - ap.add_argument("--reindex", action="store_true", help="Request a full re-index in the owner") + ap.add_argument( + "--watch", action="store_true", help="Request file watcher in the owner" + ) + ap.add_argument( + "--reindex", action="store_true", help="Request a full re-index in the owner" + ) args = ap.parse_args() _root = Path(args.root).resolve() @@ -262,7 +272,9 @@ def _graceful(signum, _frame): # Start (or reuse) the shared owner if is_owner_alive(_root): port = read_owner_port(_root) - print(f"[codegraph] attaching to existing owner on port {port}", file=sys.stderr) + print( + f"[codegraph] attaching to existing owner on port {port}", file=sys.stderr + ) else: print("[codegraph] no owner running, launching one", file=sys.stderr) port = spawn_owner(_root, watch=args.watch, reindex=args.reindex) @@ -280,7 +292,9 @@ def _graceful(signum, _frame): sys.exit(exit_code) -def owner_main(root: str | None = None, watch: bool = False, reindex: bool = False) -> None: +def owner_main( + root: str | None = None, watch: bool = False, reindex: bool = False +) -> None: """ Backend entrypoint, runs FastMCP over HTTP on a loopback port. Spawned by the proxy via `python -m codegraph _serve_owner`. Claude @@ -315,7 +329,9 @@ def owner_main(root: str | None = None, watch: bool = False, reindex: bool = Fal stats = index_repo(_root, verbose=False) print(f"[codegraph owner] done: {stats}", file=sys.stderr, flush=True) except RuntimeError as exc: - print(f"[codegraph owner] reindex skipped: {exc}", file=sys.stderr, flush=True) + print( + f"[codegraph owner] reindex skipped: {exc}", file=sys.stderr, flush=True + ) if watch: from codegraph.state.watcher import start_watcher @@ -323,7 +339,11 @@ def owner_main(root: str | None = None, watch: bool = False, reindex: bool = Fal try: start_watcher(_root) except Exception as exc: - print(f"[codegraph owner] watcher disabled: {exc}", file=sys.stderr, flush=True) + print( + f"[codegraph owner] watcher disabled: {exc}", + file=sys.stderr, + flush=True, + ) # Pick a free port + publish port file + owner pid from codegraph.state.ipc import free_port, owner_pidfile, port_file @@ -361,6 +381,8 @@ def _cleanup(): _atexit.register(_cleanup) # Build auth middleware, rejects any request without the bearer token + import hmac + from starlette.middleware.base import BaseHTTPMiddleware from starlette.responses import JSONResponse from starlette.types import ASGIApp @@ -374,7 +396,9 @@ async def dispatch(self, request, call_next): # Accept any path on 127.0.0.1 with correct bearer header = request.headers.get("authorization", "") expected = f"Bearer {self._token}" - if header != expected: + # Constant-time compare so the loopback port gives no timing oracle + # on the token (this is the system's only auth check). + if not hmac.compare_digest(header, expected): return JSONResponse( {"error": "unauthorized"}, status_code=401, diff --git a/codegraph/server/tools_arch.py b/codegraph/server/tools_arch.py index 1156ede..55c8de9 100644 --- a/codegraph/server/tools_arch.py +++ b/codegraph/server/tools_arch.py @@ -19,22 +19,14 @@ def register(mcp) -> None: import codegraph.server as _srv - from codegraph.analysis.federation import for_each_child_kuzu + from codegraph.analysis.federation import federate_scoped from codegraph.server import _get_conn, _logged_tool - def _query_each_kuzu(query_fn): - """Run query_fn(conn) on parent + each child; return [(scope, payload), …].""" - results: list[tuple[str, list]] = [] - try: - results.append(("parent", query_fn(_get_conn()) or [])) - except Exception: - results.append(("parent", [])) - if _srv._root is not None: - for scoped in for_each_child_kuzu(_srv._root, lambda c, _r: query_fn(c)): - if scoped.error: - continue - results.append((scoped.scope, scoped.payload or [])) - return results + def _query_each(query_fn): + """Run query_fn(conn) on parent + each child; return [(scope, payload), …]. + See federation.federate_scoped (warnings dropped here, callers don't use them).""" + scoped, _warnings = federate_scoped(_get_conn, _srv._root, query_fn) + return scoped @mcp.tool() @_logged_tool @@ -71,13 +63,15 @@ def query(conn): except RuntimeError: return [] - per_scope = _query_each_kuzu(query) + per_scope = _query_each(query) scopes_out: dict[str, dict] = {} from codegraph.analysis.roles import LAYER_ORDER for scope, rows in per_scope: - by_layer: dict[str, dict[str, list[dict]]] = defaultdict(lambda: defaultdict(list)) + by_layer: dict[str, dict[str, list[dict]]] = defaultdict( + lambda: defaultdict(list) + ) for row in rows: layer = row.get("layer") or "other" role = row.get("role") or "other" @@ -95,7 +89,9 @@ def query(conn): layer_dict[role_name] = files[:max_files_per_role] + [ {"path": f"... {len(files) - max_files_per_role} more"} ] - ordered = {lyr: dict(by_layer[lyr]) for lyr in LAYER_ORDER if lyr in by_layer} + ordered = { + lyr: dict(by_layer[lyr]) for lyr in LAYER_ORDER if lyr in by_layer + } for lyr, val in by_layer.items(): if lyr not in ordered: ordered[lyr] = dict(val) @@ -147,7 +143,7 @@ def query(conn): ) return out - per_scope = _query_each_kuzu(query) + per_scope = _query_each(query) hits_by_role: dict[str, list[dict]] = defaultdict(list) for scope, rows in per_scope: for row in rows: @@ -156,9 +152,15 @@ def query(conn): for role_name, files in hits_by_role.items(): files.sort(key=lambda e: e["path"]) if len(files) > limit_per_role: - hits_by_role[role_name] = files[:limit_per_role] + [{"path": f"... {len(files) - limit_per_role} more"}] - - total = sum(len(v) for v in hits_by_role.values() if not any("more" in str(e.get("path", "")) for e in v)) + hits_by_role[role_name] = files[:limit_per_role] + [ + {"path": f"... {len(files) - limit_per_role} more"} + ] + + total = sum( + len(v) + for v in hits_by_role.values() + if not any("more" in str(e.get("path", "")) for e in v) + ) return json.dumps( { "keyword": keyword, @@ -189,7 +191,14 @@ def query(conn): try: eps = conn.find_nodes( "Endpoint", - return_fields=["id", "method", "path", "framework", "file_path", "start_line"], + return_fields=[ + "id", + "method", + "path", + "framework", + "file_path", + "start_line", + ], order_by=["path", "method"], ) except RuntimeError: @@ -207,7 +216,7 @@ def query(conn): out.append(ep) return out - per_scope = _query_each_kuzu(query) + per_scope = _query_each(query) grouped: dict[str, list[dict]] = defaultdict(list) for scope, rows in per_scope: for row in rows: diff --git a/codegraph/server/tools_docs.py b/codegraph/server/tools_docs.py index e700b60..cd102d5 100644 --- a/codegraph/server/tools_docs.py +++ b/codegraph/server/tools_docs.py @@ -11,24 +11,17 @@ import json import os + def register(mcp) -> None: """Register documentation tools on the given FastMCP instance.""" import codegraph.server as _srv - from codegraph.analysis.federation import for_each_child_kuzu + from codegraph.analysis.federation import federate_scoped from codegraph.server import _get_conn, _logged_tool - def _query_each_kuzu(query_fn): - out: list[tuple[str, list]] = [] - try: - out.append(("parent", query_fn(_get_conn()) or [])) - except Exception: - out.append(("parent", [])) - if _srv._root is not None: - for scoped in for_each_child_kuzu(_srv._root, lambda c, _r: query_fn(c)): - if scoped.error: - continue - out.append((scoped.scope, scoped.payload or [])) - return out + def _query_each(query_fn): + """Parent + children fan-out, [(scope, payload), …]. See federate_scoped.""" + scoped, _warnings = federate_scoped(_get_conn, _srv._root, query_fn) + return scoped @mcp.tool() @_logged_tool @@ -43,8 +36,13 @@ def search_docs(query: str, limit: int = 10) -> str: q_str = query # capture before shadowing section_fields = [ - "title", "level", "file_path", "start_line", "end_line", - "body_preview", "anchor", + "title", + "level", + "file_path", + "start_line", + "end_line", + "body_preview", + "anchor", ] def _format(row): @@ -86,11 +84,13 @@ def run(conn): return out results: list[dict] = [] - for scope, rows in _query_each_kuzu(run): + for scope, rows in _query_each(run): for row in rows: row["scope"] = scope results.append(row) - return json.dumps({"query": q_str, "results": results, "count": len(results)}, indent=2) + return json.dumps( + {"query": q_str, "results": results, "count": len(results)}, indent=2 + ) @mcp.tool() @_logged_tool @@ -111,7 +111,7 @@ def query(conn): ) outline: list[dict] = [] - for scope, rows in _query_each_kuzu(query): + for scope, rows in _query_each(query): for row in rows: indent = " " * (row["level"] - 1) outline.append( @@ -126,7 +126,9 @@ def query(conn): } ) if not outline: - return json.dumps({"file": file_path, "outline": [], "note": "No sections found"}) + return json.dumps( + {"file": file_path, "outline": [], "note": "No sections found"} + ) return json.dumps({"file": file_path, "outline": outline}, indent=2) @mcp.tool() @@ -194,7 +196,7 @@ def query(conn): return out results: list[dict] = [] - for scope, rows in _query_each_kuzu(query): + for scope, rows in _query_each(query): for r in rows: r["scope"] = scope results.append(r) diff --git a/codegraph/server/tools_history.py b/codegraph/server/tools_history.py new file mode 100644 index 0000000..0a6950d --- /dev/null +++ b/codegraph/server/tools_history.py @@ -0,0 +1,197 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Git-history MCP tools. hotspots joins per-file churn (from +# analysis.churn over the parent root) with graph centrality +# (in-degree on the IMPORTS edge) to rank change-risk files. +# who_knows rolls up the top authors of one file from the git +# log. Both read _srv._root at call time and return JSON strings. + +from __future__ import annotations + +import json +import math +import os +import time + +# Cap on the number of files we score / return. +_HOTSPOT_SCAN_CAP = 2000 + + +def register(mcp) -> None: + """Register git-history tools on the given FastMCP instance.""" + import codegraph.server as _srv + from codegraph.analysis import churn as _churn + from codegraph.server import _get_conn, _logged_tool + + def _abs(path: str) -> str: + """Resolve a repo-relative path against the parent root.""" + if not os.path.isabs(path) and _srv._root: + return str(_srv._root / path) + return path + + @mcp.tool() + @_logged_tool + def hotspots(limit: int = 20) -> str: + """ + Change-risk hotspots: files that churn a lot AND are central to the + import graph. High-churn code that many files depend on is where a + regression hurts most, so this surfaces refactor / review targets. + + We join two signals per file: + - churn: commit count + recency, from `git log` over the parent + repo (analysis.churn.file_churn, bounded to the last N commits). + - centrality: in-degree, the number of files that import this one, + counted over the IMPORTS edge via the GraphDB protocol. + + Score formula (each term in 0..1, higher is riskier): + commit_term = log1p(commits) / log1p(max_commits) + import_term = log1p(importers) / log1p(max_importers) + recency_term = 1 / (1 + age_days / 30) # ~1 today, ~0.5 at 30d + score = round(100 * (0.45*commit_term + + 0.35*import_term + + 0.20*recency_term), 2) + Churn dominates, centrality is the multiplier that says "and it + matters", recency is a lighter freshness nudge. log1p compresses a + few hot files so they do not crush the scale. + + Args: + limit: how many top files to return (default 20). + + Returns JSON {hotspots: [{file, commits, last_modified, importers, + authors, score}], count, scanned, note}. NOT federated: git churn is + the parent repo's history only. + """ + root = _srv._root + if root is None: + return json.dumps({"hotspots": [], "count": 0, "error": "no repo root"}) + + churn = _churn.file_churn(root) + if not churn: + return json.dumps( + { + "hotspots": [], + "count": 0, + "scanned": 0, + "note": "no git history available (not a git repo or git missing)", + } + ) + + # In-degree over IMPORTS, counted once for the whole graph. Each + # IMPORTS edge is (src File) -> (dst File); the dst gains one importer. + importers: dict[str, int] = {} + try: + conn = _get_conn() + for r in conn.find_neighbors( + "IMPORTS", return_dst=["path"], limit=_HOTSPOT_SCAN_CAP * 20 + ): + dst = r.get("dst_path") + if dst: + importers[dst] = importers.get(dst, 0) + 1 + except Exception: + importers = {} + + # Map churn (repo-relative paths) onto graph paths (absolute) so we + # can attach the importer count. + items = list(churn.items())[:_HOTSPOT_SCAN_CAP] + max_commits = max((e["commits"] for _p, e in items), default=1) + max_importers = max(importers.values(), default=1) if importers else 1 + now = time.time() + log_commits = math.log1p(max_commits) + log_importers = math.log1p(max_importers) + + scored: list[dict] = [] + for rel_path, e in items: + abs_path = _abs(rel_path) + imp = importers.get(abs_path, 0) + commits = e["commits"] + last = e.get("last_modified", 0) or 0 + + commit_term = math.log1p(commits) / log_commits if log_commits else 0.0 + import_term = ( + math.log1p(imp) / log_importers if log_importers and imp else 0.0 + ) + if last > 0: + age_days = max(0.0, (now - last) / 86400.0) + recency_term = 1.0 / (1.0 + age_days / 30.0) + else: + recency_term = 0.0 + score = round( + 100.0 * (0.45 * commit_term + 0.35 * import_term + 0.20 * recency_term), + 2, + ) + + # Authors as a sorted [name, commits] list, top few only. + authors = sorted( + e.get("authors", {}).items(), key=lambda kv: kv[1], reverse=True + )[:5] + scored.append( + { + "file": rel_path, + "commits": commits, + "last_modified": last, + "importers": imp, + "authors": [{"name": n, "commits": c} for n, c in authors], + "score": score, + } + ) + + scored.sort(key=lambda r: r["score"], reverse=True) + top = scored[: max(1, int(limit))] + return json.dumps( + { + "hotspots": top, + "count": len(top), + "scanned": len(items), + "note": ( + f"churn covers the last {_churn.DEFAULT_COMMIT_CAP} commits; " + "score combines commit count (0.45), import in-degree " + "(0.35), and recency (0.20). Git history is the parent " + "repo only, not federated." + ), + }, + indent=2, + ) + + @mcp.tool() + @_logged_tool + def who_knows(file_path: str) -> str: + """ + Who knows this file: the top authors by commit count and recency, + rolled up from `git log -- `. Use it to find a reviewer or to + learn who last touched code you are about to change. + + Args: + file_path: repo-relative or absolute path to the file. + + Returns JSON {file, authors: [{name, commits, last_commit}], note}. + last_commit is a unix timestamp (seconds). NOT federated: ownership + is computed from the parent repo's git history. + """ + root = _srv._root + if root is None: + return json.dumps( + {"file": file_path, "authors": [], "error": "no repo root"} + ) + + authors = _churn.file_ownership(root, file_path) + note = ( + "authors ranked by commit count then recency, from the last " + f"{_churn.OWNERSHIP_COMMIT_CAP} commits touching this file" + ) + if not authors: + note = ( + "no git history for this file (not tracked, not a git repo, " + "or git missing)" + ) + return json.dumps( + { + "file": file_path, + "authors": authors, + "note": note, + }, + indent=2, + ) diff --git a/codegraph/server/tools_index.py b/codegraph/server/tools_index.py index 0e270a0..5fc102f 100644 --- a/codegraph/server/tools_index.py +++ b/codegraph/server/tools_index.py @@ -13,6 +13,16 @@ from pathlib import Path +def _within_repo(target: Path, root: Path) -> bool: + """True if ``target`` resolves inside ``root``. Used to keep force_index + from reading arbitrary absolute paths the repo never declared.""" + try: + target.resolve().relative_to(root.resolve()) + return True + except ValueError: + return False + + def _load_config_toml(root: Path) -> tuple[Path, dict]: """Load .codegraph/config.toml. Returns (config_path, data).""" import tomllib @@ -76,8 +86,12 @@ def force_index(paths: list[str], confirmed: bool = False) -> str: # Step 1: Preview, collect files that would be indexed preview_files = [] + refused: list[str] = [] for p in paths: target = Path(p) if os.path.isabs(p) else (root / p) if root else Path(p) + if root is not None and not _within_repo(target, root): + refused.append(str(target)) + continue if target.is_file(): if is_supported(target): preview_files.append(str(target.relative_to(root) if root else target)) @@ -100,6 +114,7 @@ def force_index(paths: list[str], confirmed: bool = False) -> str: ), "files_to_index": preview_files, "file_count": len(preview_files), + "refused_outside_repo": refused, }, indent=2, ) @@ -112,6 +127,10 @@ def force_index(paths: list[str], confirmed: bool = False) -> str: for p in paths: target = Path(p) if os.path.isabs(p) else (root / p) if root else Path(p) + if root is not None and not _within_repo(target, root): + refused.append(str(target)) + continue + if target.is_file(): try: ok = index_file(target, root, force=True) @@ -146,6 +165,7 @@ def force_index(paths: list[str], confirmed: bool = False) -> str: "indexed": indexed, "skipped": skipped, "errors": errors, + "refused_outside_repo": refused, "indexed_count": len(indexed), }, indent=2, @@ -342,7 +362,12 @@ def index_changed_files(since: str = "HEAD~1") -> str: if since == "staged": cmd = ["git", "diff", "--cached", "--name-only", "--diff-filter=ACMR"] else: - cmd = ["git", "diff", "--name-only", "--diff-filter=ACMR", since] + # Reject a leading dash so a value like "--output=/path" can't be + # parsed as a git flag (argument injection via the MCP arg). The + # trailing "--" keeps the ref from being read as a pathspec. + if since.startswith("-"): + return json.dumps({"error": f"invalid git ref: {since!r}"}) + cmd = ["git", "diff", "--name-only", "--diff-filter=ACMR", since, "--"] try: result = subprocess.run( diff --git a/codegraph/server/tools_insight.py b/codegraph/server/tools_insight.py new file mode 100644 index 0000000..db90e11 --- /dev/null +++ b/codegraph/server/tools_insight.py @@ -0,0 +1,642 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Read-only graph-insight MCP tools built over the GraphDB +# protocol: file_summary (one-shot file orientation), +# impact_of (reverse blast radius over CALLS / IMPORTS), +# path_between (shortest path over an edge type), and +# import_cycles (SCC detection on the IMPORTS graph). All are +# federated across subrepos and return JSON strings. + +from __future__ import annotations + +import json +import os + +# Hard caps so a pathological graph never blows up the JSON response. +_SYMBOL_CAP = 200 +_IMPACT_CAP = 300 +_FANOUT_CAP = 500 +_PATH_VISIT_CAP = 5000 + +# Reverse CALLS reach over-counts because CALLS edges are name-matched +# best-effort (same caveat find_dead_code carries). Keep the wording in +# one place so the note stays consistent. +_CALLS_NOTE = ( + "CALLS edges are name-matched best-effort, so reverse reach may " + "over-count: a caller listed here may resolve to a same-named symbol " + "in another file. Treat as a candidate set, not ground truth." +) + + +def register(mcp) -> None: + """Register graph-insight tools on the given FastMCP instance.""" + import codegraph.server as _srv + from codegraph.analysis.federation import federate_flat, for_each_child_graphdb + from codegraph.server import _get_conn, _logged_tool + + def _federate(query_fn): + """Parent + federated children fan-out, flattened. Returns + (results_with_scope, warnings). See federation.federate_flat.""" + return federate_flat(_get_conn, _srv._root, query_fn) + + def _abs(path: str) -> str: + """Resolve a repo-relative path against the parent root.""" + if not os.path.isabs(path) and _srv._root: + return str(_srv._root / path) + return path + + def _looks_like_path(arg: str) -> bool: + """Heuristic: does this argument name a file rather than a symbol?""" + if "/" in arg or "\\" in arg: + return True + _, ext = os.path.splitext(arg) + return ext in { + ".py", + ".ts", + ".tsx", + ".js", + ".jsx", + ".vue", + ".go", + ".rs", + ".java", + ".tf", + ".md", + } + + @mcp.tool() + @_logged_tool + def file_summary(file_path: str) -> str: + """ + One-shot orientation for a single file. Returns its role / layer / + lang / module_doc, the functions and classes it defines (name, line + range, docstring head), the modules it imports, and the files that + import it. Use this BEFORE reading a file to decide which line + ranges actually matter. + + Args: + file_path: repo-relative or absolute path to the File node. + + Federated: the file may live in the parent or in any subrepo, so we + query all scopes and aggregate. Each symbol / import row carries a + `scope` tag. Symbols are capped at 200 with a truncation note. + """ + target = _abs(file_path) + + def query(conn): + rows: list[dict] = [] + # File node metadata. There is at most one per scope. + for f in conn.find_nodes( + "File", + where={"path": target}, + return_fields=["path", "lang", "role", "layer", "module_doc"], + ): + rows.append( + { + "_kind": "file", + "path": f["path"], + "lang": f.get("lang") or "", + "role": f.get("role") or "", + "layer": f.get("layer") or "", + "module_doc": (f.get("module_doc") or "")[:300], + } + ) + for label, kind in [("Function", "function"), ("Class", "class")]: + for s in conn.find_nodes( + label, + where={"file_path": target}, + return_fields=["name", "start_line", "end_line", "docstring"], + order_by=["start_line"], + limit=_SYMBOL_CAP + 1, + ): + rows.append( + { + "_kind": "symbol", + "symbol_kind": kind, + "name": s["name"], + "lines": f"{s['start_line']}-{s['end_line']}", + "doc": (s.get("docstring") or "")[:100], + } + ) + for imp in conn.find_neighbors( + "IMPORTS", src_key=target, return_dst=["path"] + ): + rows.append({"_kind": "import", "module": imp["dst_path"]}) + for imp in conn.find_neighbors( + "IMPORTS", dst_key=target, return_src=["path"] + ): + rows.append({"_kind": "imported_by", "file": imp["src_path"]}) + return rows + + results, warnings = _federate(query) + + meta = {"role": "", "layer": "", "lang": "", "module_doc": ""} + functions: list[dict] = [] + classes: list[dict] = [] + imports: list[dict] = [] + imported_by: list[dict] = [] + found = False + for row in results: + kind = row.pop("_kind") + scope = row.get("scope", "parent") + if kind == "file": + found = True + for k in ("role", "layer", "lang", "module_doc"): + if not meta[k] and row.get(k): + meta[k] = row[k] + elif kind == "symbol": + bucket = functions if row["symbol_kind"] == "function" else classes + bucket.append( + { + "name": row["name"], + "lines": row["lines"], + "doc": row["doc"], + "scope": scope, + } + ) + elif kind == "import": + imports.append({"module": row["module"], "scope": scope}) + elif kind == "imported_by": + imported_by.append({"file": row["file"], "scope": scope}) + + total_symbols = len(functions) + len(classes) + truncated = total_symbols > _SYMBOL_CAP + if truncated: + # Trim functions first, then classes, to the global cap. + keep_fn = min(len(functions), _SYMBOL_CAP) + functions = functions[:keep_fn] + classes = classes[: max(0, _SYMBOL_CAP - keep_fn)] + + payload: dict = { + "file": target, + "found": found, + "role": meta["role"], + "layer": meta["layer"], + "lang": meta["lang"], + "module_doc": meta["module_doc"], + "functions": functions, + "classes": classes, + "imports": imports, + "imported_by": imported_by, + "truncated": truncated, + } + if truncated: + payload["note"] = f"symbols capped at {_SYMBOL_CAP}" + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) + + @mcp.tool() + @_logged_tool + def impact_of(symbol_or_file: str, max_depth: int = 3) -> str: + """ + Reverse blast radius: what depends on a symbol or file. If the + argument looks like a path (or matches a File node) we walk IMPORTS + backward to find every file that transitively imports it. Otherwise + we resolve it as a function name and walk CALLS backward to find + every transitive caller. + + Args: + symbol_or_file: a function name, or a repo-relative / absolute path. + max_depth: how many hops of reverse reach (default 3). + + Returns JSON with `direction` ("callers" or "importers"), the + `impacted` set grouped by role / layer for files, any reaching + endpoints, a `count`, and `truncated`. + + Federated per scope. CALLS reverse reach over-counts (see `note`): + edges are name-matched, so a listed caller may belong to a + same-named symbol elsewhere. + """ + arg = symbol_or_file + as_path = _looks_like_path(arg) + + def _resolve_is_file(conn) -> bool: + hits = conn.find_nodes("File", where={"path": _abs(arg)}, limit=1) + return bool(hits) + + # Decide direction once, against the parent conn, then reuse it for + # every scope so the result set is homogeneous. + if not as_path: + try: + as_path = _resolve_is_file(_get_conn()) + except Exception: + as_path = False + + direction = "importers" if as_path else "callers" + edge = "IMPORTS" if as_path else "CALLS" + + def reverse_bfs(conn, start_keys: list[str]) -> tuple[list[str], bool]: + """Bounded reverse BFS: collect source keys reachable into the + start keys within max_depth hops. Returns (keys, truncated).""" + seen: set[str] = set(start_keys) + frontier = list(start_keys) + ordered: list[str] = [] + truncated = False + depth = 0 + while frontier and depth < max(1, int(max_depth)): + depth += 1 + next_frontier: list[str] = [] + for key in frontier: + if as_path: + rows = conn.find_neighbors( + edge, dst_key=key, return_src=["path"], limit=_FANOUT_CAP + ) + srcs = [r["src_path"] for r in rows] + else: + rows = conn.find_neighbors( + edge, dst_key=key, return_src=["id"], limit=_FANOUT_CAP + ) + srcs = [r["src_id"] for r in rows] + if len(rows) >= _FANOUT_CAP: + truncated = True + for s in srcs: + if s in seen: + continue + seen.add(s) + ordered.append(s) + next_frontier.append(s) + if len(ordered) >= _IMPACT_CAP: + return ordered, True + frontier = next_frontier + return ordered, truncated + + def query(conn): + # Resolve the starting key(s) within this scope. + if as_path: + start_keys = [_abs(arg)] + else: + start_keys = [ + r["id"] + for r in conn.find_nodes( + "Function", where={"name": arg}, return_fields=["id"] + ) + ] + if not start_keys: + return [] + keys, trunc = reverse_bfs(conn, start_keys) + + out: list[dict] = [] + if as_path: + # Each impacted key is a file path; enrich with role / layer. + for path in keys: + role = layer = lang = "" + fnodes = conn.find_nodes( + "File", + where={"path": path}, + return_fields=["role", "layer", "lang"], + limit=1, + ) + if fnodes: + role = fnodes[0].get("role") or "" + layer = fnodes[0].get("layer") or "" + lang = fnodes[0].get("lang") or "" + out.append( + { + "node": path, + "node_kind": "file", + "role": role, + "layer": layer, + "lang": lang, + "_trunc": trunc, + } + ) + else: + # Each impacted key is a Function id "file::name". + for fid in keys: + file_path = fid.rsplit("::", 1)[0] if "::" in fid else "" + role = layer = "" + if file_path: + fnodes = conn.find_nodes( + "File", + where={"path": file_path}, + return_fields=["role", "layer"], + limit=1, + ) + if fnodes: + role = fnodes[0].get("role") or "" + layer = fnodes[0].get("layer") or "" + out.append( + { + "node": fid, + "node_kind": "function", + "file": file_path, + "role": role, + "layer": layer, + "_trunc": trunc, + } + ) + return out + + results, warnings = _federate(query) + + impacted: list[dict] = [] + by_role: dict[str, int] = {} + by_layer: dict[str, int] = {} + endpoints: list[dict] = [] + truncated = len(results) > _IMPACT_CAP + files_for_endpoints: set[tuple[str, str]] = set() + for row in results: + if row.pop("_trunc", False): + truncated = True + scope = row.get("scope", "parent") + role = row.get("role") or "" + layer = row.get("layer") or "" + if role: + by_role[role] = by_role.get(role, 0) + 1 + if layer: + by_layer[layer] = by_layer.get(layer, 0) + 1 + impacted.append(row) + fp = row.get("file") or (row["node"] if row["node_kind"] == "file" else "") + if fp: + files_for_endpoints.add((scope, fp)) + if role and ("router" in role.lower() or "endpoint" in role.lower()): + endpoints.append( + {"file": fp or row["node"], "role": role, "scope": scope} + ) + + if len(impacted) > _IMPACT_CAP: + impacted = impacted[:_IMPACT_CAP] + truncated = True + + # Endpoints declared in any impacted file (DEFINES_ENDPOINT). + def endpoint_query(conn): + rows: list[dict] = [] + for _scope, fp in files_for_endpoints: + for e in conn.find_neighbors( + "DEFINES_ENDPOINT", + src_key=fp, + return_dst=["method", "path"], + ): + rows.append( + { + "file": fp, + "method": e.get("dst_method", ""), + "path": e.get("dst_path", ""), + } + ) + return rows + + if files_for_endpoints: + ep_rows, ep_warnings = _federate(endpoint_query) + warnings = warnings + ep_warnings + seen_ep = {(e["file"], e.get("path", "")) for e in endpoints} + for e in ep_rows: + key = (e["file"], e.get("path", "")) + if key in seen_ep: + continue + seen_ep.add(key) + endpoints.append(e) + + payload: dict = { + "target": arg, + "direction": direction, + "depth": int(max_depth), + "impacted": impacted, + "count": len(impacted), + "by_role": by_role, + "by_layer": by_layer, + "endpoints": endpoints, + "truncated": truncated, + } + if not as_path: + payload["note"] = _CALLS_NOTE + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) + + @mcp.tool() + @_logged_tool + def path_between(src: str, dst: str, edge: str = "CALLS") -> str: + """ + Shortest path between two symbols or files over an edge type. With + edge="CALLS" (default) src / dst are function names; with + edge="IMPORTS" they are file paths. Runs a forward BFS from src and + reconstructs the first path that reaches dst. + + Args: + src: start function name (CALLS) or file path (IMPORTS). + dst: end function name (CALLS) or file path (IMPORTS). + edge: "CALLS" or "IMPORTS" (default "CALLS"). + + Returns JSON `{src, dst, edge, path: [...], length}` or + `{found: false}`. Per scope: a path is reported from the first scope + that contains one, so it never crosses repo boundaries. + """ + edge = (edge or "CALLS").upper() + if edge not in {"CALLS", "IMPORTS"}: + return json.dumps( + {"found": False, "error": f"unsupported edge type: {edge}"} + ) + is_calls = edge == "CALLS" + + def query(conn): + # Resolve start / end keys for this scope. + if is_calls: + start_ids = [ + r["id"] + for r in conn.find_nodes( + "Function", where={"name": src}, return_fields=["id"] + ) + ] + dst_ids = { + r["id"] + for r in conn.find_nodes( + "Function", where={"name": dst}, return_fields=["id"] + ) + } + ret = ["id"] + ret_key = "dst_id" + else: + start_ids = [_abs(src)] + dst_ids = {_abs(dst)} + ret = ["path"] + ret_key = "dst_path" + if not start_ids or not dst_ids: + return None + + # Forward BFS with parent pointers for path reconstruction. + visited: set[str] = set(start_ids) + parent: dict[str, str | None] = {k: None for k in start_ids} + frontier = list(start_ids) + hit: str | None = None + for s in start_ids: + if s in dst_ids: + hit = s + break + while frontier and hit is None and len(visited) < _PATH_VISIT_CAP: + next_frontier: list[str] = [] + for cur in frontier: + rows = conn.find_neighbors(edge, src_key=cur, return_dst=ret) + for r in rows: + nxt = r[ret_key] + if nxt in visited: + continue + visited.add(nxt) + parent[nxt] = cur + if nxt in dst_ids: + hit = nxt + break + next_frontier.append(nxt) + if hit is not None: + break + frontier = next_frontier + if hit is None: + return None + # Reconstruct. + chain: list[str] = [] + node: str | None = hit + while node is not None: + chain.append(node) + node = parent.get(node) + chain.reverse() + return chain + + results = None + warnings: list[dict] = [] + # We need per-scope payloads (a path lives in one scope), so use the + # scoped variant directly rather than the flat helper. + try: + parent_chain = query(_get_conn()) + except Exception as exc: + parent_chain = None + warnings.append( + {"scope": "parent", "error": f"{type(exc).__name__}: {exc}"} + ) + if parent_chain: + results = ("parent", parent_chain) + if results is None and _srv._root is not None: + for s in for_each_child_graphdb(_srv._root, lambda c, _r: query(c)): + if s.error: + warnings.append({"scope": s.scope, "error": s.error}) + continue + if s.payload: + results = (s.scope, s.payload) + break + + if results is None: + payload: dict = {"src": src, "dst": dst, "edge": edge, "found": False} + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) + + scope, chain = results + payload = { + "src": src, + "dst": dst, + "edge": edge, + "found": True, + "scope": scope, + "path": chain, + "length": len(chain) - 1, + } + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) + + @mcp.tool() + @_logged_tool + def import_cycles(limit: int = 50) -> str: + """ + Detect import cycles in the File->File IMPORTS graph. Builds the + adjacency from every IMPORTS edge and reports each strongly-connected + component of size > 1 (a cycle). Runs per scope: cycles never cross + repo boundaries, so each component lives in one scope. + + Args: + limit: cap on the number of cycles returned (default 50). + + Returns JSON `{cycles: [[file, file, ...], ...], count, truncated}`, + each cycle tagged inline with its scope via the file paths it holds. + """ + + def query(conn): + adj: dict[str, list[str]] = {} + for row in conn.find_neighbors( + "IMPORTS", return_src=["path"], return_dst=["path"] + ): + src = row["src_path"] + dst = row["dst_path"] + adj.setdefault(src, []).append(dst) + adj.setdefault(dst, []) + comps = _tarjan_scc(adj) + # Only components that are a real cycle: size > 1, or a self-loop. + cycles: list[dict] = [] + for comp in comps: + if len(comp) > 1: + cycles.append({"cycle": sorted(comp)}) + elif len(comp) == 1: + node = comp[0] + if node in adj.get(node, []): + cycles.append({"cycle": [node]}) + return cycles + + results, warnings = _federate(query) + cycles = [r["cycle"] for r in results] + truncated = len(cycles) > limit + cycles = cycles[:limit] + payload: dict = { + "cycles": cycles, + "count": len(cycles), + "truncated": truncated, + } + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) + + +def _tarjan_scc(adj: dict[str, list[str]]) -> list[list[str]]: + """Tarjan strongly-connected components, iterative to avoid recursion + limits on large import graphs. Returns a list of components (each a list + of node keys).""" + index_counter = [0] + index: dict[str, int] = {} + lowlink: dict[str, int] = {} + on_stack: dict[str, bool] = {} + stack: list[str] = [] + result: list[list[str]] = [] + + for start in list(adj.keys()): + if start in index: + continue + # Iterative DFS. work stack holds (node, neighbor_iterator_position). + work: list[tuple[str, int]] = [(start, 0)] + while work: + node, pi = work[-1] + if pi == 0: + index[node] = lowlink[node] = index_counter[0] + index_counter[0] += 1 + stack.append(node) + on_stack[node] = True + neighbors = adj.get(node, []) + if pi < len(neighbors): + work[-1] = (node, pi + 1) + nxt = neighbors[pi] + if nxt not in index: + work.append((nxt, 0)) + elif on_stack.get(nxt): + lowlink[node] = min(lowlink[node], index[nxt]) + else: + if lowlink[node] == index[node]: + comp: list[str] = [] + while True: + w = stack.pop() + on_stack[w] = False + comp.append(w) + if w == node: + break + result.append(comp) + work.pop() + if work: + parent_node = work[-1][0] + lowlink[parent_node] = min(lowlink[parent_node], lowlink[node]) + return result diff --git a/codegraph/server/tools_meta.py b/codegraph/server/tools_meta.py index bf2990d..8736711 100644 --- a/codegraph/server/tools_meta.py +++ b/codegraph/server/tools_meta.py @@ -34,7 +34,9 @@ def _result_to_dict(r, scope): "kind": r.kind, "name": r.name, "file": r.file_path, - "lines": f"{r.start_line}-{r.end_line}" if r.end_line else str(r.start_line), + "lines": f"{r.start_line}-{r.end_line}" + if r.end_line + else str(r.start_line), "doc": r.docstring, "score": round(r.score, 4), } @@ -52,18 +54,24 @@ def _result_to_dict(r, scope): ) all_results.extend(_result_to_dict(r, "parent") for r in parent_results) except Exception as exc: - warnings.append({"scope": "parent", "error": f"{type(exc).__name__}: {exc}"}) + warnings.append( + {"scope": "parent", "error": f"{type(exc).__name__}: {exc}"} + ) # Children, fresh RO conns if _srv._root is not None: for scoped in for_each_child_fts( _srv._root, - lambda c, _r: _fts(c, query, limit=limit, kind_filter=kind if kind else None), + lambda c, _r: _fts( + c, query, limit=limit, kind_filter=kind if kind else None + ), ): if scoped.error: warnings.append({"scope": scoped.scope, "error": scoped.error}) continue - all_results.extend(_result_to_dict(r, scoped.scope) for r in scoped.payload or []) + all_results.extend( + _result_to_dict(r, scoped.scope) for r in scoped.payload or [] + ) # Sort across federation by score (BM25 returns negative, higher abs is better) all_results.sort(key=lambda x: -x["score"]) @@ -91,7 +99,7 @@ def find_dead_code( Treat the results as a per-scope candidate list, not a hard verdict. """ from codegraph.analysis.dead_code import find_dead_code as _find_dead - from codegraph.analysis.federation import for_each_child_kuzu + from codegraph.analysis.federation import for_each_child_graphdb all_dead: list[dict] = [] warnings: list[dict] = [] @@ -114,10 +122,12 @@ def find_dead_code( } ) except Exception as exc: - warnings.append({"scope": "parent", "error": f"{type(exc).__name__}: {exc}"}) + warnings.append( + {"scope": "parent", "error": f"{type(exc).__name__}: {exc}"} + ) if _srv._root is not None: - for scoped in for_each_child_kuzu( + for scoped in for_each_child_graphdb( _srv._root, lambda c, _r: _find_dead( c, @@ -196,7 +206,9 @@ def context_for_task( if session_id and not include_shown: from codegraph.state.activity import log as _activity_log - served_nodes = [("symbol", f"{n.file_path}:{n.start_line}") for n in ctx.nodes] + served_nodes = [ + ("symbol", f"{n.file_path}:{n.start_line}") for n in ctx.nodes + ] served_mem = [("memory", m.path) for m in ctx.memory_docs] served_plans = [("plan", p.path) for p in ctx.plan_docs] served_know = [("knowledge", str(k.id)) for k in ctx.knowledge_docs] @@ -205,15 +217,32 @@ def context_for_task( unseen = set(filter_unseen(session_id, all_entities, repo_root=_srv._root)) before = len(ctx.nodes) + len(ctx.memory_docs) + len(ctx.plan_docs) - ctx.nodes = [n for n in ctx.nodes if ("symbol", f"{n.file_path}:{n.start_line}") in unseen] - ctx.memory_docs = [m for m in ctx.memory_docs if ("memory", m.path) in unseen] + ctx.nodes = [ + n + for n in ctx.nodes + if ("symbol", f"{n.file_path}:{n.start_line}") in unseen + ] + ctx.memory_docs = [ + m for m in ctx.memory_docs if ("memory", m.path) in unseen + ] ctx.plan_docs = [p for p in ctx.plan_docs if ("plan", p.path) in unseen] - ctx.knowledge_docs = [k for k in ctx.knowledge_docs if ("knowledge", str(k.id)) in unseen] - - after = len(ctx.nodes) + len(ctx.memory_docs) + len(ctx.plan_docs) + len(ctx.knowledge_docs) + ctx.knowledge_docs = [ + k for k in ctx.knowledge_docs if ("knowledge", str(k.id)) in unseen + ] + + after = ( + len(ctx.nodes) + + len(ctx.memory_docs) + + len(ctx.plan_docs) + + len(ctx.knowledge_docs) + ) if before != after: try: - _activity_log(_srv._root, "session_dedup", f"{session_id} hid {before - after}") + _activity_log( + _srv._root, + "session_dedup", + f"{session_id} hid {before - after}", + ) except Exception: pass @@ -229,7 +258,13 @@ def context_for_task( # Recompute derived fields after filtering ctx.files_referenced = sorted(set(n.file_path for n in ctx.nodes)) - ctx.token_estimate = sum(len(n.name) + len(n.docstring) + len(n.file_path) + 50 for n in ctx.nodes) // 4 + ctx.token_estimate = ( + sum( + len(n.name) + len(n.docstring) + len(n.file_path) + 50 + for n in ctx.nodes + ) + // 4 + ) md = render_context_markdown(ctx) return json.dumps( diff --git a/codegraph/server/tools_query.py b/codegraph/server/tools_query.py index 3e86f5f..11edd01 100644 --- a/codegraph/server/tools_query.py +++ b/codegraph/server/tools_query.py @@ -12,41 +12,17 @@ import json import os + def register(mcp) -> None: """Register query tools on the given FastMCP instance.""" import codegraph.server as _srv - from codegraph.analysis.federation import for_each_child_kuzu + from codegraph.analysis.federation import federate_flat from codegraph.server import _get_conn, _logged_tool - def _federate_kuzu(query_fn): - """ - Run query_fn(conn) against the parent's write conn (in-process) - and each federated subrepo's RO Kuzu DB. Returns: - (results_with_scope, partial_warnings) - Each item in results_with_scope is whatever query_fn returned, with - a "scope" key injected. partial_warnings is a list of dicts - {scope, error} for child DBs that couldn't be queried. - """ - all_results: list[dict] = [] - warnings: list[dict] = [] - # Parent, direct hit on the in-process write connection - try: - for item in query_fn(_get_conn()) or []: - item["scope"] = "parent" - all_results.append(item) - except Exception as exc: - warnings.append({"scope": "parent", "error": f"{type(exc).__name__}: {exc}"}) - - # Children, fresh RO conns, errors per child - if _srv._root is not None: - for scoped in for_each_child_kuzu(_srv._root, lambda c, _r: query_fn(c)): - if scoped.error: - warnings.append({"scope": scoped.scope, "error": scoped.error}) - continue - for item in scoped.payload or []: - item["scope"] = scoped.scope - all_results.append(item) - return all_results, warnings + def _federate(query_fn): + """Parent + federated children fan-out, flattened. Returns + (results_with_scope, warnings). See federation.federate_flat.""" + return federate_flat(_get_conn, _srv._root, query_fn) @mcp.tool() @_logged_tool @@ -94,10 +70,14 @@ def pattern_search( case_sensitive=case_sensitive, ) for h in hits: - all_hits.append({"scope": "parent", "file": h.file, "line": h.line, "text": h.text}) - - # Each federated subrepo - for child in resolve_children(_srv._root) if _srv._root else []: + all_hits.append( + {"scope": "parent", "file": h.file, "line": h.line, "text": h.text} + ) + + # Each federated subrepo. Resolve children once and reuse for the cap + # below (this used to read + parse config.toml twice per query). + children = resolve_children(_srv._root) if _srv._root else [] + for child in children: try: child_hits, _ = _search( child, @@ -110,10 +90,17 @@ def pattern_search( except Exception: continue for h in child_hits: - all_hits.append({"scope": child.name, "file": h.file, "line": h.line, "text": h.text}) + all_hits.append( + { + "scope": child.name, + "file": h.file, + "line": h.line, + "text": h.text, + } + ) # Apply max_results across the whole federation as a soft cap - all_hits = all_hits[: max_results * (1 + len(resolve_children(_srv._root)) if _srv._root else 1)] + all_hits = all_hits[: max_results * (1 + len(children))] return json.dumps( { @@ -128,12 +115,15 @@ def pattern_search( @mcp.tool() @_logged_tool - def symbol_lookup(name: str) -> str: + def symbol_lookup(name: str, role: str = "", layer: str = "") -> str: """ Find where a symbol (function, class, TF resource) is defined. Returns file path, line range, type, and docstring snippet, plus a `scope` tag (parent / ) when federation is on. Use this instead of grepping files. + + Optional `role` / `layer` filters keep only definitions whose File + node carries that exact role / layer (empty = no filter). """ def query(conn): @@ -181,7 +171,11 @@ def query(conn): "MdSection", contains={"title": name}, return_fields=[ - "file_path", "start_line", "end_line", "body_preview", "anchor", + "file_path", + "start_line", + "end_line", + "body_preview", + "anchor", ], ): out.append( @@ -193,9 +187,25 @@ def query(conn): "anchor": row["anchor"], } ) + if role or layer: + cache: dict[str, tuple[str, str]] = {} + kept = [] + for hit in out: + fp = hit.get("file") + if not fp: + continue + if fp not in cache: + cache[fp] = _file_role_layer(conn, fp) + r, lyr = cache[fp] + if role and r != role: + continue + if layer and lyr != layer: + continue + kept.append(hit) + return kept return out - results, warnings = _federate_kuzu(query) + results, warnings = _federate(query) if not results: payload = {"found": False, "name": name} if warnings: @@ -231,7 +241,7 @@ def query(conn): ) ] - callers, warnings = _federate_kuzu(query) + callers, warnings = _federate(query) out = {"fn": fn_name, "callers": callers} if warnings: out["partial"] = True @@ -259,7 +269,7 @@ def query(conn): ) ] - callees, warnings = _federate_kuzu(query) + callees, warnings = _federate(query) out = {"fn": fn_name, "callees": callees} if warnings: out["partial"] = True @@ -288,20 +298,38 @@ def query(conn): ) ] - imports, warnings = _federate_kuzu(query) + imports, warnings = _federate(query) out = {"file": file_path, "imports": imports} if warnings: out["partial"] = True out["warnings"] = warnings return json.dumps(out, indent=2) + def _file_role_layer(conn, file_path: str) -> tuple[str, str]: + """Return (role, layer) for a File node, ('', '') when unknown.""" + nodes = conn.find_nodes( + "File", + where={"path": file_path}, + return_fields=["role", "layer"], + limit=1, + ) + if not nodes: + return "", "" + return (nodes[0].get("role") or "", nodes[0].get("layer") or "") + @mcp.tool() @_logged_tool - def search_symbols(query: str, limit: int = 20) -> str: + def search_symbols( + query: str, limit: int = 20, role: str = "", layer: str = "" + ) -> str: """ Fuzzy search for symbols (functions, classes, TF resources) by name. Uses substring match. Federated, `limit` is per scope, results are concatenated; sort/trim downstream if needed. + + Optional `role` / `layer` filters keep only symbols whose File node + carries that exact role / layer (empty = no filter). Useful to scope + a search to e.g. role="router" or layer="domain". """ def run(conn): @@ -352,10 +380,33 @@ def run(conn): "anchor": row["anchor"], } ) + if role or layer: + # Filter each hit by its File node's role / layer. Cache + # per-file lookups so repeated hits in the same file cost one + # query. Markdown / TF hits without a File node drop out. + cache: dict[str, tuple[str, str]] = {} + kept = [] + for hit in out: + fp = hit.get("file") + if not fp: + continue + if fp not in cache: + cache[fp] = _file_role_layer(conn, fp) + r, lyr = cache[fp] + if role and r != role: + continue + if layer and lyr != layer: + continue + kept.append(hit) + return kept return out - results, warnings = _federate_kuzu(run) + results, warnings = _federate(run) payload = {"query": query, "results": results} + if role: + payload["role"] = role + if layer: + payload["layer"] = layer if warnings: payload["partial"] = True payload["warnings"] = warnings @@ -377,7 +428,11 @@ def subgraph(file_path: str, depth: int = 1) -> str: def run_deps(conn): return [ - {"kind": "depends_on", "file": row["dst_path"], "lang": row["dst_lang"]} + { + "kind": "depends_on", + "file": row["dst_path"], + "lang": row["dst_lang"], + } for row in conn.find_neighbors( "IMPORTS", src_key=file_path, return_dst=["path", "lang"] ) @@ -385,16 +440,22 @@ def run_deps(conn): def run_rdeps(conn): return [ - {"kind": "depended_by", "file": row["src_path"], "lang": row["src_lang"]} + { + "kind": "depended_by", + "file": row["src_path"], + "lang": row["src_lang"], + } for row in conn.find_neighbors( "IMPORTS", dst_key=file_path, return_src=["path", "lang"] ) ] - deps_all, w1 = _federate_kuzu(run_deps) - rdeps_all, w2 = _federate_kuzu(run_rdeps) + deps_all, w1 = _federate(run_deps) + rdeps_all, w2 = _federate(run_rdeps) depends_on = [{k: v for k, v in d.items() if k != "kind"} for d in deps_all] - depended_by = [{k: v for k, v in d.items() if k != "kind"} for d in rdeps_all] + depended_by = [ + {k: v for k, v in d.items() if k != "kind"} for d in rdeps_all + ] payload = { "file": file_path, "depth": depth, @@ -412,11 +473,14 @@ def run_reach(conn): return [ {"file": row["path"], "lang": row["lang"]} for row in conn.reach_via_edge( - "IMPORTS", file_path, max_depth=int(depth), return_fields=["path", "lang"] + "IMPORTS", + file_path, + max_depth=int(depth), + return_fields=["path", "lang"], ) ] - deps, warnings = _federate_kuzu(run_reach) + deps, warnings = _federate(run_reach) payload = {"file": file_path, "depth": depth, "reachable": deps} if warnings: payload["partial"] = True diff --git a/codegraph/server/tools_tests.py b/codegraph/server/tools_tests.py new file mode 100644 index 0000000..57af84d --- /dev/null +++ b/codegraph/server/tools_tests.py @@ -0,0 +1,165 @@ +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# __creation__ = 2026-06-07 +# __author__ = "jndjama (Joy Ndjama)" +# __copyright__ = "Copyright 2026 ALTIKVA." +# __licence__ = "MIT & CC BY-NC-SA (http://www.altikva.com/licenses/LICENSE-1.0)" +# -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# +# Description: Test-to-code mapping MCP tools, computed on the fly from the +# existing IMPORTS / CALLS edges plus File.role. No TESTS edge +# type and no schema change. tests_for(symbol_or_file) surfaces +# the test files that exercise a target; untested(role, layer) +# lists source files no test imports. Both federate across +# parent + subrepos and return JSON strings. + +from __future__ import annotations + +import json +import os + +from codegraph.analysis import impact as _impact + +# Cap untested output so a large repo cannot produce an unbounded list. +_UNTESTED_CAP = 200 + +# Shared caveat: this mapping is inferred from import / call edges, not from +# running a coverage tool. Keep the wording in one place. +_INFER_NOTE = ( + "Inferred from IMPORTS / CALLS edges plus File.role, not from a coverage " + "run. A test counts if it imports the target file (or, for a symbol, calls " + "it). Treat as a heuristic, not ground truth." +) + + +def register(mcp) -> None: + """Register the test-mapping tools on the given FastMCP instance.""" + import codegraph.server as _srv + from codegraph.analysis.federation import federate_flat + from codegraph.server import _get_conn, _logged_tool + + def _federate(query_fn): + """Parent + federated children fan-out, flattened.""" + return federate_flat(_get_conn, _srv._root, query_fn) + + def _abs(path: str) -> str: + """Resolve a repo-relative path against the parent root.""" + if not os.path.isabs(path) and _srv._root: + return str(_srv._root / path) + return path + + @mcp.tool() + @_logged_tool + def tests_for(symbol_or_file: str) -> str: + """ + Find the test files that exercise a target symbol or file. Resolves + the argument to a defining File node, then reports test files (role + `test`) that IMPORTS-> that file, plus, when the target is a symbol, + test files whose functions CALLS-reach it. + + Args: + symbol_or_file: a function / class name, or a repo-relative / + absolute file path. + + Returns JSON `{target, tests: [{file, role, scope}], count, note}`. + This is an inferred import/call heuristic, NOT a coverage tool: see + `note`. Federated across parent + subrepos; each test row carries a + `scope` tag. + """ + # Path-like args resolve against the parent root for the File lookup. + looks_path = ( + "/" in symbol_or_file + or "\\" in symbol_or_file + or (os.path.splitext(symbol_or_file)[1] != "") + ) + arg = _abs(symbol_or_file) if looks_path else symbol_or_file + + def query(conn): + res = _impact.tests_for(conn, arg) + return list(res["tests"]) + + results, warnings = _federate(query) + + seen: set[tuple[str, str]] = set() + tests: list[dict] = [] + for row in results: + scope = row.get("scope", "parent") + key = (scope, row.get("file", "")) + if key in seen: + continue + seen.add(key) + tests.append( + { + "file": row.get("file", ""), + "role": row.get("role", ""), + "scope": scope, + } + ) + + payload: dict = { + "target": symbol_or_file, + "tests": tests, + "count": len(tests), + "note": _INFER_NOTE, + } + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) + + @mcp.tool() + @_logged_tool + def untested(role: str = "", layer: str = "") -> str: + """ + List non-test source files that NO test file imports. Optionally + filter by File.role (e.g. "service", "router") or File.layer (e.g. + "application", "domain"). Test and doc files are never reported. + + Args: + role: optional File.role filter (exact match). + layer: optional File.layer filter (exact match). + + Returns JSON `{untested: [{file, role, layer, scope}], count, note}`, + capped at 200 with a truncation note. Inferred from import edges, not + coverage (see `note`). Federated across parent + subrepos. + """ + + def query(conn): + rows, _trunc = _impact.untested_files( + conn, role=role, layer=layer, cap=_UNTESTED_CAP + ) + if _trunc and rows: + rows[-1] = {**rows[-1], "_trunc": True} + return rows + + results, warnings = _federate(query) + + truncated = False + untested_rows: list[dict] = [] + for row in results: + if row.pop("_trunc", False): + truncated = True + scope = row.get("scope", "parent") + untested_rows.append( + { + "file": row.get("file", ""), + "role": row.get("role", ""), + "layer": row.get("layer", ""), + "scope": scope, + } + ) + + if len(untested_rows) > _UNTESTED_CAP: + untested_rows = untested_rows[:_UNTESTED_CAP] + truncated = True + + payload: dict = { + "untested": untested_rows, + "count": len(untested_rows), + "note": _INFER_NOTE, + } + if truncated: + payload["truncated"] = True + payload["truncation_note"] = f"capped at {_UNTESTED_CAP} files per scope" + if warnings: + payload["partial"] = True + payload["warnings"] = warnings + return json.dumps(payload, indent=2) diff --git a/codegraph/server/tools_viz.py b/codegraph/server/tools_viz.py index 7073d55..485f81a 100644 --- a/codegraph/server/tools_viz.py +++ b/codegraph/server/tools_viz.py @@ -42,7 +42,10 @@ def _viz_file_imports(conn, file_path: str, max_nodes: int, fmt: str) -> str: return_src=["path"], return_dst=["path"], ) - rows = [{"src": r["src_path"], "tgt": r["dst_path"]} for r in outgoing + incoming] + rows = [ + {"src": r["src_path"], "tgt": r["dst_path"]} + for r in outgoing + incoming + ] else: rows = [ {"src": r["src_path"], "tgt": r["dst_path"]} @@ -106,7 +109,9 @@ def _viz_call_graph(conn, symbol_name: str, max_nodes: int, fmt: str) -> str: edges = edges_a + edges_b else: edges = conn.find_neighbors( - "CALLS", **return_args, limit=max_nodes * 2, + "CALLS", + **return_args, + limit=max_nodes * 2, ) rows = [ @@ -165,7 +170,9 @@ def _viz_class_hierarchy(conn, symbol_name: str, max_nodes: int, fmt: str) -> st edges = edges_a + edges_b else: edges = conn.find_neighbors( - "INHERITS", **return_args, limit=max_nodes, + "INHERITS", + **return_args, + limit=max_nodes, ) rows = [ @@ -234,7 +241,9 @@ def _viz_file_symbols(conn, file_path: str, fmt: str) -> str: lines = ["graph TD", f' {file_id}["{short}"]:::file'] for cls in classes: cls_id = _safe_id(f"cls_{cls['name']}") - lines.append(f' {cls_id}["{cls["name"]} (L{cls["start_line"]})"]:::class') + lines.append( + f' {cls_id}["{cls["name"]} (L{cls["start_line"]})"]:::class' + ) lines.append(f" {file_id} --> {cls_id}") for fn in fns: fn_id = _safe_id(f"fn_{fn['name']}_{fn['start_line']}") @@ -243,7 +252,9 @@ def _viz_file_symbols(conn, file_path: str, fmt: str) -> str: for sec in sections: sec_id = _safe_id(f"sec_{sec['title']}_{sec['start_line']}") prefix = "#" * sec["level"] - lines.append(f' {sec_id}["{prefix} {sec["title"]} L{sec["start_line"]}"]:::doc') + lines.append( + f' {sec_id}["{prefix} {sec["title"]} L{sec["start_line"]}"]:::doc' + ) lines.append(f" {file_id} --> {sec_id}") lines.append(" classDef file fill:#e1f5fe,stroke:#0288d1") lines.append(" classDef class fill:#fff3e0,stroke:#f57c00") @@ -251,9 +262,15 @@ def _viz_file_symbols(conn, file_path: str, fmt: str) -> str: lines.append(" classDef doc fill:#fce4ec,stroke:#c62828") return "\n".join(lines) else: - lines = ["digraph file_symbols {", " rankdir=TD;", f' "{short}" [shape=folder];'] + lines = [ + "digraph file_symbols {", + " rankdir=TD;", + f' "{short}" [shape=folder];', + ] for cls in classes: - lines.append(f' "{cls["name"]}" [shape=box,style=filled,fillcolor=lightyellow];') + lines.append( + f' "{cls["name"]}" [shape=box,style=filled,fillcolor=lightyellow];' + ) lines.append(f' "{short}" -> "{cls["name"]}";') for fn in fns: lines.append(f' "{fn["name"]}" [shape=ellipse];') @@ -298,7 +315,9 @@ def _viz_doc_structure(conn, file_path: str, max_nodes: int, fmt: str) -> str: for sec in secs: sec_id = _safe_id(f"s_{sec['start_line']}_{fp}") prefix = "#" * sec["level"] - lines.append(f' {sec_id}["{prefix} {sec["title"]}"]:::h{min(sec["level"], 3)}') + lines.append( + f' {sec_id}["{prefix} {sec["title"]}"]:::h{min(sec["level"], 3)}' + ) parent_id = None for lvl in range(sec["level"] - 1, 0, -1): if lvl in prev_by_level: @@ -343,9 +362,7 @@ def _viz_full_overview(conn, max_nodes: int, fmt: str) -> str: # Top files by symbol density: count DEFINES_FN edges per file. # find_neighbors gives us (file_path, function_id) pairs; tally. - defines = conn.find_neighbors( - "DEFINES_FN", return_src=["path", "lang"] - ) + defines = conn.find_neighbors("DEFINES_FN", return_src=["path", "lang"]) fn_per_file: dict[tuple[str, str | None], int] = {} for r in defines: key = (r["src_path"], r.get("src_lang")) @@ -360,14 +377,20 @@ def _viz_full_overview(conn, max_nodes: int, fmt: str) -> str: if fmt == "mermaid": lines = ["graph TD"] - lines.append(f' REPO["{_srv._root.name if _srv._root else "repo"}"]:::repo') + lines.append( + f' REPO["{_srv._root.name if _srv._root else "repo"}"]:::repo' + ) for ls in lang_stats: lang_id = _safe_id(ls["lang"] or "unknown") - lines.append(f' {lang_id}["{ls["lang"] or "other"}: {ls["cnt"]} files"]:::lang') + lines.append( + f' {lang_id}["{ls["lang"] or "other"}: {ls["cnt"]} files"]:::lang' + ) lines.append(f" REPO --> {lang_id}") - lines.append(f' STATS["Functions: {fn_count} | Classes: {cls_count} | Doc sections: {md_count}"]:::stats') + lines.append( + f' STATS["Functions: {fn_count} | Classes: {cls_count} | Doc sections: {md_count}"]:::stats' + ) lines.append(" REPO --> STATS") for tf in top_files[:10]: @@ -377,7 +400,9 @@ def _viz_full_overview(conn, max_nodes: int, fmt: str) -> str: lines.append(f' {tf_id}["{short} ({tf["fn_count"]} fns)"]:::hotfile') lines.append(f" {lang_id} --> {tf_id}") - lines.append(" classDef repo fill:#1a237e,stroke:#fff,color:#fff,stroke-width:2px") + lines.append( + " classDef repo fill:#1a237e,stroke:#fff,color:#fff,stroke-width:2px" + ) lines.append(" classDef lang fill:#e8eaf6,stroke:#3f51b5") lines.append(" classDef stats fill:#f3e5f5,stroke:#7b1fa2") lines.append(" classDef hotfile fill:#fff3e0,stroke:#e65100") @@ -389,11 +414,35 @@ def _viz_full_overview(conn, max_nodes: int, fmt: str) -> str: f' repo [label="{_srv._root.name if _srv._root else "repo"}",shape=box3d];', ] for ls in lang_stats: - lines.append(f' "{ls["lang"]}" [label="{ls["lang"]}: {ls["cnt"]} files"];') + lines.append( + f' "{ls["lang"]}" [label="{ls["lang"]}: {ls["cnt"]} files"];' + ) lines.append(f' repo -> "{ls["lang"]}";') lines.append("}") return "\n".join(lines) + def _viz_layers(conn, fmt: str) -> str: + """Layer-dependency diagram. Reuses the backend-neutral builder in + viz.mermaid so the CLI `cgh graph layers` and this MCP scope render + the same thing. For dot we emit the same layer->layer edges.""" + from codegraph.viz.mermaid import _layer_edge_counts, _layer_sort_key + + if fmt == "mermaid": + from codegraph.viz.mermaid import mermaid_layers + + return mermaid_layers(conn) + + counts = _layer_edge_counts(conn) + if not counts: + return 'digraph layers {\n NO_LAYERS [label="No layer edges"];\n}' + lines = ["digraph layers {", " rankdir=TD;"] + for sl, dl in sorted( + counts, key=lambda e: (_layer_sort_key(e[0]), _layer_sort_key(e[1])) + ): + lines.append(f' "{sl}" -> "{dl}" [label="{counts[(sl, dl)]}"];') + lines.append("}") + return "\n".join(lines) + # ------------------------------------------------------------------- # MCP tool registrations # ------------------------------------------------------------------- @@ -419,6 +468,7 @@ def visualize_graph( - "file_symbols": all symbols defined in a file - "doc_structure": markdown documentation structure - "full_overview": high-level overview of the codebase + - "layers": architectural layer-to-layer dependency graph file_path: filter to a specific file (optional, for file_imports/file_symbols) symbol_name: filter to a specific symbol (optional, for call_graph/class_hierarchy) max_nodes: max nodes to include (default 30) @@ -441,6 +491,8 @@ def visualize_graph( diagram = _viz_doc_structure(conn, file_path, max_nodes, format) elif scope == "full_overview": diagram = _viz_full_overview(conn, max_nodes, format) + elif scope == "layers": + diagram = _viz_layers(conn, format) else: return json.dumps({"error": f"Unknown scope: {scope}"}) @@ -468,7 +520,14 @@ def graph_stats() -> str: conn = _get_conn() stats = { label: conn.count_nodes(label) - for label in ("File", "Function", "Class", "TFResource", "TFVar", "MdSection") + for label in ( + "File", + "Function", + "Class", + "TFResource", + "TFVar", + "MdSection", + ) } return json.dumps(stats, indent=2) diff --git a/codegraph/state/auth.py b/codegraph/state/auth.py index 8b39348..97e6ab1 100644 --- a/codegraph/state/auth.py +++ b/codegraph/state/auth.py @@ -6,22 +6,22 @@ # -#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # Description: MCP auth key management: generation, storage, validation. # -# The auth key protects the MCP server from unauthorized access. -# Defense-in-depth for when codegraph moves to HTTP transport. +# The auth key protects the owner's loopback HTTP bridge from other local +# processes. It is the shared secret behind the Bearer-token check. # # Key lifecycle: -# 1. `cgh init` generates the key → .codegraph/auth.key -# 2. `cgh setup` injects it into .mcp.json as CODEGRAPH_AUTH_KEY env var -# 3. Server reads CODEGRAPH_AUTH_KEY on startup and validates requests +# 1. `cgh init` (or the first owner) generates the key -> .codegraph/auth.key, +# mode 0600, gitignored. +# 2. Both the owner and every worker/CLI caller read that file via +# ensure_auth_key() and send `Authorization: Bearer `. +# The file contents are the secret; there is no env-var hand-off. from __future__ import annotations -import os import secrets from pathlib import Path AUTH_KEY_FILE = "auth.key" -AUTH_KEY_ENV = "CODEGRAPH_AUTH_KEY" _CODEGRAPH_DIR = ".codegraph" @@ -84,37 +84,3 @@ def ensure_gitignore_has_auth_key(repo_root: str | Path) -> bool: f.write(f"\n# codegraph auth key (never commit)\n{pattern}\n") return True return False - - -def inject_auth_key_into_mcp_json(repo_root: str | Path, key: str) -> bool: - """ - Add CODEGRAPH_AUTH_KEY to the codegraph server env in .mcp.json. - Returns True if the file was modified. - """ - import json - - mcp_path = Path(repo_root) / ".mcp.json" - if not mcp_path.exists(): - return False - - data = json.loads(mcp_path.read_text(encoding="utf-8")) - servers = data.get("mcpServers", {}) - cg_server = servers.get("codegraph") - if cg_server is None: - return False - - env = cg_server.setdefault("env", {}) - if env.get(AUTH_KEY_ENV) == key: - return False # already set - - env[AUTH_KEY_ENV] = key - mcp_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8") - return True - - -def validate_server_auth_key() -> str | None: - """ - Read the auth key from environment on server startup. - Returns the key if set, None if auth is disabled (no key configured). - """ - return os.environ.get(AUTH_KEY_ENV) diff --git a/codegraph/viz/__init__.py b/codegraph/viz/__init__.py index 9e302e5..39149df 100644 --- a/codegraph/viz/__init__.py +++ b/codegraph/viz/__init__.py @@ -12,6 +12,7 @@ mermaid_classes, mermaid_docs, mermaid_imports, + mermaid_layers, mermaid_overview, ) @@ -20,6 +21,7 @@ "mermaid_classes", "mermaid_docs", "mermaid_imports", + "mermaid_layers", "mermaid_overview", "generate_html", "open_in_browser", diff --git a/codegraph/viz/html.py b/codegraph/viz/html.py index 64c6d0b..0d02d03 100644 --- a/codegraph/viz/html.py +++ b/codegraph/viz/html.py @@ -116,7 +116,9 @@ - +