Skip to content

Add IR to compiler; validate in smart constructors#196

Merged
hhugo merged 11 commits into
masterfrom
ir
Apr 13, 2026
Merged

Add IR to compiler; validate in smart constructors#196
hhugo merged 11 commits into
masterfrom
ir

Conversation

@hhugo
Copy link
Copy Markdown
Collaborator

@hhugo hhugo commented Apr 7, 2026

Summary

  • Introduce Ir.t, a pure intermediate representation for sedlex patterns that captures regexp structure and named captures before tag allocation. Lives in src/compiler/ir.ml with no ppxlib dependency.
  • Add Sedlex.compile_ir : Ir.t array -> compiled_ir to the compiler library. It handles tag allocation (Start_plus/End_minus optimizations), discriminator insertion for or-patterns, and DFA construction.
  • The PPX becomes a thin translator from OCaml pattern AST to Ir.t via ir_of_pattern, delegating all optimization decisions to the compiler.
  • Flatten Alt from binary to n-ary (Alt of t list), simplifying discriminator assignment to a single pass and removing the reusable_cell heuristic.
  • Move structural validation into IR smart constructors returning (t, string) result: capture checks shadowing, alt checks name consistency across branches, star/plus/rep reject inner captures. The PPX uses a local unwrap helper to bridge results to ppxlib error reporting.
  • Use SSet instead of string list for capture_names.
  • Improve error messages: bare operators (Star, Plus, Opt, etc.) without arguments now produce specific errors instead of falling through to "not a valid regexp"; unknown constructors produce "unknown sedlex operator X".

Motivation

The PPX previously interleaved pattern parsing with tag allocation decisions (bind vs bind_start_only vs bind_end_only), discriminator handling, and fixed_length computation. This made it impossible to test the compilation pipeline without ppxlib, and spread compiler concerns across two packages.

With the IR, tests can construct Ir.t values directly and exercise the full compile_ir path. Future optimizations (tag delay, register allocation) can be added to the compiler without touching the PPX.

Test plan

  • dune build succeeds
  • dune runtest passes — all existing expect tests produce identical output (modulo disc cell renumbering from the flat Alt change)
  • Error messages preserved for all validation cases (captures inside Star/Plus/Rep/Opt/Compl/Sub/Intersect, mismatched or-pattern names)

🤖 Generated with Claude Code

@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Apr 7, 2026

@toots, I think you'll like this PR as well. It will help with #194 quite a bit.

hhugo and others added 5 commits April 8, 2026 00:00
Introduce Ir.t, a pure intermediate representation for sedlex patterns
that captures regexp structure and named captures before tag allocation.
The compiler's new compile_ir entry point handles tag allocation
(Start_plus/End_minus optimizations), discriminator insertion for
or-patterns, and DFA construction. The PPX becomes a thin translator
from OCaml pattern AST to Ir.t.

Reject capture that shadows an inner binding of the same name

Validate that `(... as x) as x` is rejected — a Capture node must not
bind a name that already appears inside its inner pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change Alt from binary (Alt of t * t) to n-ary (Alt of t list) with
smart constructor that flattens nested Alts. This lets the compiler see
all branches at once and assign discriminators in a single pass,
removing the reusable_cell heuristic that was needed to handle OCaml's
left-nested desugaring of or-patterns.

Branches with identical bindings now share discriminator values, which
can reduce the number of distinct values needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move structural validation from the PPX into IR smart constructors
that return (t, string) result: capture checks shadowing, alt checks
name consistency across branches, star/plus/rep reject inner captures.
Add reject_captures for contexts that forbid captures (Opt, Compl,
Sub, Intersect).

The PPX uses a local unwrap helper to bridge IR results to ppxlib's
exception-based error reporting. Remove the now-redundant validate
function in favour of a debug-only check_invariant assertion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bare operators without argument (Star, Plus, Opt, Utf8, Latin1, Ascii)
  now produce "the X operator requires an argument" instead of falling
  through to "this pattern is not a valid regexp".
- Unknown constructors (e.g. Some) produce "unknown sedlex operator X".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhugo hhugo changed the title Add IR to compiler; move tag allocation out of PPX Add IR to compiler; validate in smart constructors Apr 7, 2026
hhugo and others added 6 commits April 13, 2026 22:14
Validate capture names before flattening in alt (simpler logic).
Rewrite check_invariant to return SSet.t instead of threading
~inside_rep, making assertions more direct. Extract cp helper
in pretty-printer and use it consistently for character ranges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the 0 <= n <= m check into Ir.rep so it is enforced regardless
of call site. Keep the explicit check in the PPX for better error
locations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the special case that skipped discriminators for capture-free
branches — add_discriminators already handles that case correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhugo hhugo merged commit 8bfdc7c into master Apr 13, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant