feat(parser): add TeX/LaTeX support (.tex/.sty/.cls/.ltx/.latex) (#289)#291
Merged
Conversation
Adds a hand-rolled LaTeX parser to core-ingestion. There is no maintained Node tree-sitter-latex binding for this ABI, and TeX's custom-macro arities are not statically knowable by a grammar (real grammars emit ERROR nodes on ordinary documents), so a targeted single-pass O(n) scanner is both the available and the more robust option. It degrades gracefully on malformed input and is deterministic (byte-identical re-parse). Extracts: - Sectioning hierarchy (\part..\subparagraph) as nested CONTAINS - Definitions: \newcommand/\renewcommand/\providecommand/\DeclareRobustCommand, \def/\let, \newenvironment, \newtheorem - \label / \bibitem anchors - Environments (\begin..\end) chained file -> section -> environment -> content - IMPORTS: \usepackage/\RequirePackage/\documentclass (packages); \input/\include/\subfile/\import (resolved to .tex); \bibliography (.bib) - REFERENCES: \ref/\eqref/\cref/... to labels; \cite/\citep/\citet/... to keys - Skips verbatim/lstlisting/minted bodies; strips comments; ignores the dynamic \csname...\endcsname name primitive Also resyncs ix-cli's discovery allowlist (SUPPORTED_EXTENSIONS) with the canonical EXT_MAP. It had drifted and was missing every recently added parser (css/lua/zig/html/xml/hcl/bash/haskell), so those files were never walked for `ix map`; this restores their discovery and adds the LaTeX extensions. `ix text --language latex` and language inference wired up too. Validated on 3 real repos (a PhD thesis with .cls + multi-file \include, the beamer package, a LaTeX book): 203 files, 0 crashes, 0 nondeterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, ext drift
Review of the LaTeX parser surfaced three real issues, now fixed:
1. Dangling CONTAINS edges (most serious). Content inside an environment was
parented to the environment node (e.g. `figure`), but environment type names
are never unique within a file and patch-builder resolves a CONTAINS edge's
*source* by bare name with no container hint — so as soon as two sections
each held a `figure`, every label/def inside resolved to an ambiguous,
non-existent `figure` node and its edge dangled (i.e. every real multi-figure
paper). Definitions, labels, environments and references now attach to the
nearest enclosing SECTION (unique titles → safe CONTAINS source); environments
remain recorded as section members. Verified on the corpus: 0 CONTAINS edges
are now sourced from an environment name.
2. File-level definitions were given `container = <fileName>`, producing
qualified keys like `main.tex.\foo` instead of `\foo` (inconsistent with how
sections and Markdown headings handle the file level). Now `undefined` at file
level.
3. Verbatim/listing end detection used an exact `\end{env}` substring and missed
the legal `\end {verbatim}` (whitespace) form, swallowing the rest of the file.
Now whitespace-tolerant via an escaped-name regex.
Also collapsed the three drifted copies of the discovery extension allowlist
(ingest.ts / watch.ts / stale.ts) into one shared `supported-extensions.ts`, so
`ix map`, `ix watch`, and stale detection cover the same set and cannot drift
apart again.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #289.
Summary
Adds a TeX/LaTeX parser to
core-ingestionso.tex,.sty,.cls,.ltx, and.latexfiles are analyzed instead of ignored.The only published npm
tree-sitter-latexis an abandoned0.0.0nan-based placeholder that does not build on the current tree-sitter ABI, and the latex-lsp grammar ships no Node binding. Following the issue's documented fallback, this is a hand-rolled single-pass O(n) scanner — which is also the more robust option for TeX, whose custom-macro arities a static grammar cannot know (real grammars emit ERROR nodes on ordinary documents). It mirrors the existing hand-rolled parsers (Markdown/YAML/TOML/…), degrades gracefully on malformed input, and is deterministic.Extracted
\part→\subparagraphas nestedCONTAINS; environments (\begin..\end) chainedfile → section → environment → content\newcommand/\renewcommand/\providecommand/\DeclareRobustCommand,\def/\let,\newenvironment,\newtheorem;\label/\bibitemanchors\usepackage/\RequirePackage/\documentclass(packages);\input/\include/\subfile/\import(resolved to.tex);\bibliography/\addbibresource(.bib)\ref/\eqref/\cref/… → matching\label;\cite/\citep/\citet/… → bib keysverbatim/lstlisting/mintedbodies, strips comments, ignores the dynamic\csname…\endcsnamename primitiveDrive-by fix
ix-cli's discovery allowlist (SUPPORTED_EXTENSIONS) had drifted out of sync with the canonicalEXT_MAPand was missing every recently added parser (css/lua/zig/html/xml/hcl/bash/haskell) — so those files were never walked forix map. Resynced it (and added the LaTeX extensions).ix text --language latex+ language inference wired up too.Testing
queries.latex.test.ts) covering every construct + malformed/determinism/comment/verbatim edge cases..cls+ multi-file\include, the beamer package, a LaTeX book): 203 files, 0 crashes, 0 nondeterministic (byte-compared a double-parse of every file). Spot-checked quality: includes resolve to.tex,\reflands on the matching\label, beamer yields 2320 macros / 295 sections / 223 imports.core-ingestionandix-clitypecheck clean.Merge-requirement checklist (#289)
.sty/ book with\part/\chapter).texdispatch unchanged; only additive)\newcommand/\def/\newenvironmentdefinitions as entities\labelanchors;\ref/\eqref/\cref→ label REFERENCES\usepackage/\RequirePackage/\documentclassIMPORTS\input/\include/\subfileresolved to.tex\cite/\citep/\citetREFERENCES to bib keysCONTAINSchainingix textreturnslanguage: latexix containsreturns members for a.texfile (CONTAINS edges emitted)