Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
919 changes: 159 additions & 760 deletions graphify/extract.py

Large diffs are not rendered by default.

32 changes: 29 additions & 3 deletions graphify/extractors/MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,38 @@ written so an AI agent can execute it in a single session.
| zig | yes |
| elixir | yes |
| razor | yes |
| (40 more in extract.py) | no |
| csharp | partial — helpers + cross-file resolver split into `extractors/csharp*.py`; config-driven `extract_csharp` entry stays (see Middle path) |
| (39 more in extract.py) | no |

Note: config-driven extractors (python, js, java, c, cpp, ruby, csharp,
kotlin, scala, php, lua, swift, groovy) depend on the shared
`_extract_generic` core (~1,300 lines). Do NOT port them one-by-one; the core
must move first as its own coordinated batch. Pick a bespoke extractor.
`_extract_generic` core (~1,300 lines). Do NOT move the config-driven
`extract_<lang>` ENTRY POINT one-by-one; the core must move first as its own
coordinated batch. Pick a bespoke extractor for a full port.

### Middle path (config-driven helper split — the C# pattern)

Even for a config-driven language you can give it a real module home *before* the
`_extract_generic` batch: split the language-specific **helpers** (per-file
binding / type-table / shadow model, type references, imports) and the cross-file
**member-call resolver** into their own modules — see `extractors/csharp_extract.py`,
`extractors/csharp_resolve.py`, and `extractors/csharp.py` — facade-re-exported
from `extract.py`, while the thin `extract_<lang>` → `_extract_generic` entry
point stays inline. Guardrails:

- **Prove the split is behavior-preserving** with a normalized node/edge snapshot
over BOTH the single-file `extract_<lang>` and the multi-file `extract` entry
points (order-preserving — a list-sorting canonical hides a fact-order
regression).
- **Lift shared config** (`LanguageConfig`) to `base.py`; keep the import
direction `extract.py -> extractors/` (never import `graphify.extract` here).
- Only pull C#-only logic out of `_extract_generic` where the helper **returns
facts** and the core keeps emission (no `add_node`/`add_edge`/`nodes`/`seen_ids`
threaded in) — otherwise leave a thin inline hook.

Unlike a verbatim entry-point port, a middle-path split may ship alongside a
feature (e.g. the C# member-call resolver) with its own tests; the snapshot +
full suite are the preservation proof.

## Invariants (non-negotiable)

Expand Down
48 changes: 48 additions & 0 deletions graphify/extractors/base.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# DO NOT import from graphify.extract here — direction is extract.py → extractors/ only.
from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import Callable

from graphify.ids import make_id

Expand Down Expand Up @@ -64,3 +66,49 @@ def _file_stem(path: Path) -> str:

def _read_text(node, source: bytes) -> str:
return source[node.start_byte:node.end_byte].decode("utf-8", errors="replace")


# ── LanguageConfig dataclass ─────────────────────────────────────────────────

@dataclass
class LanguageConfig:
ts_module: str # e.g. "tree_sitter_python"
ts_language_fn: str = "language" # attr to call: e.g. tslang.language()

class_types: frozenset = frozenset()
function_types: frozenset = frozenset()
import_types: frozenset = frozenset()
call_types: frozenset = frozenset()
static_prop_types: frozenset = frozenset()
helper_fn_names: frozenset = frozenset()
container_bind_methods: frozenset = frozenset()
event_listener_properties: frozenset = frozenset()

# Name extraction
name_field: str = "name"
name_fallback_child_types: tuple = ()

# Body detection
body_field: str = "body"
body_fallback_child_types: tuple = () # e.g. ("declaration_list", "compound_statement")

# Call name extraction
call_function_field: str = "function" # field on call node for callee
call_accessor_node_types: frozenset = frozenset() # member/attribute nodes
call_accessor_field: str = "attribute" # field on accessor for method name
call_accessor_object_field: str = "" # field on accessor for the receiver/object

# Stop recursion at these types in walk_calls
function_boundary_types: frozenset = frozenset()

# Import handler: called for import nodes instead of generic handling
import_handler: Callable | None = None

# Optional custom name resolver for functions (C, C++ declarator unwrapping)
resolve_function_name_fn: Callable | None = None

# Extra label formatting for functions: if True, functions get "name()" label
function_label_parens: bool = True

# Extra walk hook called after generic dispatch (for JS arrow functions, C# namespaces, etc.)
extra_walk_fn: Callable | None = None
Loading
Loading