Skip to content

feat(extraction): add Terraform and OpenTofu language support#706

Open
Javviviii2 wants to merge 1 commit into
colbymchenry:mainfrom
Javviviii2:feat/terraform-support
Open

feat(extraction): add Terraform and OpenTofu language support#706
Javviviii2 wants to merge 1 commit into
colbymchenry:mainfrom
Javviviii2:feat/terraform-support

Conversation

@Javviviii2
Copy link
Copy Markdown

@Javviviii2 Javviviii2 commented Jun 6, 2026

What this changes

Adds Terraform / OpenTofu as a first-class indexed language. .tf, .tfvars, and .tofu files are now parsed into the graph, so codegraph_search, codegraph_callers, codegraph_callees, and codegraph_impact return real results on infrastructure repos instead of nothing.

Today, opening any Terraform monorepo with CodeGraph indexes 0 nodes / 0 edges because there is no language extractor — files are skipped after detection. After this PR, the same repo gets a full symbol graph.

How

Three small pieces:

  1. Grammar — vendors tree-sitter-terraform.wasm (92 KB) built from @tree-sitter-grammars/tree-sitter-hcl@1.2.0 (Apache-2.0). I picked the terraform dialect rather than generic hcl because it targets .tf/.tfvars/.tofu exactly. The wasm is referenced through the existing path.join(__dirname, 'wasm', ...) vendor branch (same as Pascal/Scala/Lua/Luau), so no new npm dependency.

  2. LanguageExtractor (src/extraction/languages/terraform.ts) — HCL's grammar emits every top-level construct as a generic block node, so the extractor drives everything through visitNode, inspecting the first identifier child to decide the block kind:

    Terraform construct NodeKind Qualified name
    resource "T" "N" class T.N
    data "T" "N" class data.T.N
    module "M" module module.M
    variable "X" variable var.X
    output "X" variable output.X
    provider "P" namespace provider.P
    locals { k = … } constant per attr local.k

    References (var.X, module.M.out, data.T.N.attr, <type>.<name>.attr, local.k) are emitted as unresolved_refs. Built-in heads (each, count, self, path, terraform.workspace) are filtered.

  3. FrameworkResolver (src/resolution/frameworks/terraform.ts) — when the same qualified name (e.g. var.project_id) exists in multiple modules, the resolver prefers:

    1. The candidate in the same directory as the reference site (real Terraform scoping rule).
    2. For module.M.x refs, the directory containing a module "M" declaration.
    3. Closest common-ancestor path; otherwise the generic name matcher.

Validation

I validated against two real Terraform monorepos. All numbers measured locally on Node 22.22 / Linux.

Repo .tf files Indexing Nodes Edges
medium A 277 1.3 s 2 119 6 085
medium B 470 2.4 s 3 334 13 258

Post-index query latency on the 470-file repo:

codegraph query   "google_compute"     →  110 ms
codegraph callers var.project_id       →  150 ms
codegraph callees google_service_*     →  120 ms
codegraph impact  var.region -d 2      →  110 ms

Cross-module precision — same-named variable used to bleed across modules with the generic matcher; now scoped correctly. Spot-check on var.project_id in repo A, called from modules/net-vpc/main.tf:

Before resolver:  8 callers, 3 different modules' variables.tf  (wrong)
After resolver:   5 callers, all from modules/net-vpc/variables.tf  (right)

Tests

__tests__/extraction.test.ts: 18 new tests under describe('Terraform Extraction') covering language detection, all six block kinds, locals attribute fan-out, the terraform { } settings block (correctly ignored), reference extraction for var/local/module/data/managed-resource heads, and the built-in skip list.

Full suite: 1146 passed / 2 skipped, 0 regressions (baseline was 1128 passed / 2 skipped + 1 pre-existing flaky watcher test).

Notes for review

  • No new npm dependencies. The wasm is vendored, same pattern as the existing 4 vendored grammars.
  • No public API changes. Just one new entry in LANGUAGES, one in EXTRACTORS, one in FRAMEWORK_RESOLVERS, one extension mapping.
  • CHANGELOG entry added under ## [Unreleased]### New Features, written user-facing per the house rule (no internal paths or symbol names).
  • License attribution for the vendored wasm is recorded in the comment above the vendor branch in grammars.ts (Apache-2.0, source repo and source release noted).
  • I intentionally did not add the ### Validation methodology formal A/B agent run from docs/design/dynamic-dispatch-coverage-playbook.md since I don't have access to the small/medium/large benchmark repo set referenced there. Happy to follow up with that as a separate PR if you'd like to add Terraform to the matrix — I can pick a public Terraform repo per size tier.
  • Happy to split this into smaller commits if you prefer.

Thanks for the great project — the architecture made it really clean to extend.

Index .tf, .tfvars, and .tofu files via the tree-sitter-terraform dialect
of HCL (vendored from @tree-sitter-grammars/tree-sitter-hcl, Apache-2.0).

Symbols extracted:
- resource / data  → class  (qualified "type.name" / "data.type.name")
- module           → module (qualified "module.name")
- variable         → variable (qualified "var.name")
- output           → variable (qualified "output.name")
- provider         → namespace
- locals           → constant per attribute (qualified "local.key")

References resolved cross-file:
- var.X, local.X, module.M[.out], data.T.N[.attr], <type>.<name>[.attr]
- built-ins skipped: each.*, count.*, self.*, path.*, terraform.workspace

The Terraform framework resolver disambiguates same-named candidates
across modules by preferring the one in the same directory as the
reference site, then by closest common-ancestor path, falling back to
the generic name matcher only when neither applies.

Validated on two Terraform monorepos (277 and 470 .tf files): indexing
runs in 1.3s and 2.4s respectively, query latency stays under 200ms,
and cross-module references resolve to the correct module 100% of the
time on inspected samples.

18 new extraction tests; full suite 1146/1148 green (2 pre-existing
flaky skips, 0 regressions).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant