Skip to content

feat(parser): add TeX/LaTeX support (.tex/.sty/.cls/.ltx/.latex) (#289)#291

Merged
josephismikhail merged 2 commits into
mainfrom
feat/latex-parser
Jun 11, 2026
Merged

feat(parser): add TeX/LaTeX support (.tex/.sty/.cls/.ltx/.latex) (#289)#291
josephismikhail merged 2 commits into
mainfrom
feat/latex-parser

Conversation

@josephismikhail

Copy link
Copy Markdown
Contributor

Closes #289.

Summary

Adds a TeX/LaTeX parser to core-ingestion so .tex, .sty, .cls, .ltx, and .latex files are analyzed instead of ignored.

The only published npm tree-sitter-latex is an abandoned 0.0.0 nan-based placeholder that does not build on the current tree-sitter ABI, and the latex-lsp grammar ships no Node binding. Following the issue's documented fallback, this is a hand-rolled single-pass O(n) scanner — which is also the more robust option for TeX, whose custom-macro arities a static grammar cannot know (real grammars emit ERROR nodes on ordinary documents). It mirrors the existing hand-rolled parsers (Markdown/YAML/TOML/…), degrades gracefully on malformed input, and is deterministic.

Extracted

  • Structure: sectioning hierarchy \part\subparagraph as nested CONTAINS; environments (\begin..\end) chained file → section → environment → content
  • Definitions: \newcommand/\renewcommand/\providecommand/\DeclareRobustCommand, \def/\let, \newenvironment, \newtheorem; \label/\bibitem anchors
  • IMPORTS: \usepackage/\RequirePackage/\documentclass (packages); \input/\include/\subfile/\import (resolved to .tex); \bibliography/\addbibresource (.bib)
  • REFERENCES: \ref/\eqref/\cref/… → matching \label; \cite/\citep/\citet/… → bib keys
  • Skips verbatim/lstlisting/minted bodies, strips comments, ignores the dynamic \csname…\endcsname name primitive

Drive-by fix

ix-cli's discovery allowlist (SUPPORTED_EXTENSIONS) had drifted out of sync with the canonical EXT_MAP and was missing every recently added parser (css/lua/zig/html/xml/hcl/bash/haskell) — so those files were never walked for ix map. Resynced it (and added the LaTeX extensions). ix text --language latex + language inference wired up too.

Testing

  • 24 unit tests (queries.latex.test.ts) covering every construct + malformed/determinism/comment/verbatim edge cases.
  • Validated on 3 real repos (a PhD thesis with .cls + multi-file \include, the beamer package, a LaTeX book): 203 files, 0 crashes, 0 nondeterministic (byte-compared a double-parse of every file). Spot-checked quality: includes resolve to .tex, \ref lands on the matching \label, beamer yields 2320 macros / 295 sections / 223 imports.
  • core-ingestion and ix-cli typecheck clean.

Merge-requirement checklist (#289)

  • Tested on 3 real LaTeX repos (paper-with-includes / package .sty / book with \part/\chapter)
  • No regressions on existing parsers (non-.tex dispatch unchanged; only additive)
  • Unit + smoke tests
  • Deterministic (byte-identical re-parse)
  • Nested sectioning hierarchy
  • \newcommand/\def/\newenvironment definitions as entities
  • \label anchors; \ref/\eqref/\cref → label REFERENCES
  • \usepackage/\RequirePackage/\documentclass IMPORTS
  • \input/\include/\subfile resolved to .tex
  • \cite/\citep/\citet REFERENCES to bib keys
  • Environments with CONTAINS chaining
  • Graceful malformed / unbalanced handling (no crash/hang)
  • ix text returns language: latex
  • ix contains returns members for a .tex file (CONTAINS edges emitted)

josephismikhail and others added 2 commits June 11, 2026 15:04
Adds a hand-rolled LaTeX parser to core-ingestion. There is no maintained
Node tree-sitter-latex binding for this ABI, and TeX's custom-macro arities
are not statically knowable by a grammar (real grammars emit ERROR nodes on
ordinary documents), so a targeted single-pass O(n) scanner is both the
available and the more robust option. It degrades gracefully on malformed
input and is deterministic (byte-identical re-parse).

Extracts:
- Sectioning hierarchy (\part..\subparagraph) as nested CONTAINS
- Definitions: \newcommand/\renewcommand/\providecommand/\DeclareRobustCommand,
  \def/\let, \newenvironment, \newtheorem
- \label / \bibitem anchors
- Environments (\begin..\end) chained file -> section -> environment -> content
- IMPORTS: \usepackage/\RequirePackage/\documentclass (packages);
  \input/\include/\subfile/\import (resolved to .tex); \bibliography (.bib)
- REFERENCES: \ref/\eqref/\cref/... to labels; \cite/\citep/\citet/... to keys
- Skips verbatim/lstlisting/minted bodies; strips comments; ignores the
  dynamic \csname...\endcsname name primitive

Also resyncs ix-cli's discovery allowlist (SUPPORTED_EXTENSIONS) with the
canonical EXT_MAP. It had drifted and was missing every recently added parser
(css/lua/zig/html/xml/hcl/bash/haskell), so those files were never walked for
`ix map`; this restores their discovery and adds the LaTeX extensions.
`ix text --language latex` and language inference wired up too.

Validated on 3 real repos (a PhD thesis with .cls + multi-file \include, the
beamer package, a LaTeX book): 203 files, 0 crashes, 0 nondeterministic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, ext drift

Review of the LaTeX parser surfaced three real issues, now fixed:

1. Dangling CONTAINS edges (most serious). Content inside an environment was
   parented to the environment node (e.g. `figure`), but environment type names
   are never unique within a file and patch-builder resolves a CONTAINS edge's
   *source* by bare name with no container hint — so as soon as two sections
   each held a `figure`, every label/def inside resolved to an ambiguous,
   non-existent `figure` node and its edge dangled (i.e. every real multi-figure
   paper). Definitions, labels, environments and references now attach to the
   nearest enclosing SECTION (unique titles → safe CONTAINS source); environments
   remain recorded as section members. Verified on the corpus: 0 CONTAINS edges
   are now sourced from an environment name.

2. File-level definitions were given `container = <fileName>`, producing
   qualified keys like `main.tex.\foo` instead of `\foo` (inconsistent with how
   sections and Markdown headings handle the file level). Now `undefined` at file
   level.

3. Verbatim/listing end detection used an exact `\end{env}` substring and missed
   the legal `\end {verbatim}` (whitespace) form, swallowing the rest of the file.
   Now whitespace-tolerant via an escaped-name regex.

Also collapsed the three drifted copies of the discovery extension allowlist
(ingest.ts / watch.ts / stale.ts) into one shared `supported-extensions.ts`, so
`ix map`, `ix watch`, and stale detection cover the same set and cannot drift
apart again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@josephismikhail josephismikhail merged commit dbb517b into main Jun 11, 2026
26 of 28 checks passed
@josephismikhail josephismikhail deleted the feat/latex-parser branch June 11, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add TeX/LaTeX parser support (.tex, .sty, .cls)

1 participant