Conversation
Collaborator
Author
Introduce Ir.t, a pure intermediate representation for sedlex patterns that captures regexp structure and named captures before tag allocation. The compiler's new compile_ir entry point handles tag allocation (Start_plus/End_minus optimizations), discriminator insertion for or-patterns, and DFA construction. The PPX becomes a thin translator from OCaml pattern AST to Ir.t. Reject capture that shadows an inner binding of the same name Validate that `(... as x) as x` is rejected — a Capture node must not bind a name that already appears inside its inner pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change Alt from binary (Alt of t * t) to n-ary (Alt of t list) with smart constructor that flattens nested Alts. This lets the compiler see all branches at once and assign discriminators in a single pass, removing the reusable_cell heuristic that was needed to handle OCaml's left-nested desugaring of or-patterns. Branches with identical bindings now share discriminator values, which can reduce the number of distinct values needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move structural validation from the PPX into IR smart constructors that return (t, string) result: capture checks shadowing, alt checks name consistency across branches, star/plus/rep reject inner captures. Add reject_captures for contexts that forbid captures (Opt, Compl, Sub, Intersect). The PPX uses a local unwrap helper to bridge IR results to ppxlib's exception-based error reporting. Remove the now-redundant validate function in favour of a debug-only check_invariant assertion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bare operators without argument (Star, Plus, Opt, Utf8, Latin1, Ascii) now produce "the X operator requires an argument" instead of falling through to "this pattern is not a valid regexp". - Unknown constructors (e.g. Some) produce "unknown sedlex operator X". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
Validate capture names before flattening in alt (simpler logic). Rewrite check_invariant to return SSet.t instead of threading ~inside_rep, making assertions more direct. Extract cp helper in pretty-printer and use it consistently for character ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the 0 <= n <= m check into Ir.rep so it is enforced regardless of call site. Keep the explicit check in the PPX for better error locations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the special case that skipped discriminators for capture-free branches — add_discriminators already handles that case correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ir.t, a pure intermediate representation for sedlex patterns that captures regexp structure and named captures before tag allocation. Lives insrc/compiler/ir.mlwith no ppxlib dependency.Sedlex.compile_ir : Ir.t array -> compiled_irto the compiler library. It handles tag allocation (Start_plus/End_minus optimizations), discriminator insertion for or-patterns, and DFA construction.Ir.tviair_of_pattern, delegating all optimization decisions to the compiler.Altfrom binary to n-ary (Alt of t list), simplifying discriminator assignment to a single pass and removing thereusable_cellheuristic.(t, string) result:capturechecks shadowing,altchecks name consistency across branches,star/plus/repreject inner captures. The PPX uses a localunwraphelper to bridge results to ppxlib error reporting.SSetinstead ofstring listforcapture_names.Motivation
The PPX previously interleaved pattern parsing with tag allocation decisions (
bindvsbind_start_onlyvsbind_end_only), discriminator handling, andfixed_lengthcomputation. This made it impossible to test the compilation pipeline without ppxlib, and spread compiler concerns across two packages.With the IR, tests can construct
Ir.tvalues directly and exercise the fullcompile_irpath. Future optimizations (tag delay, register allocation) can be added to the compiler without touching the PPX.Test plan
dune buildsucceedsdune runtestpasses — all existing expect tests produce identical output (modulo disc cell renumbering from the flat Alt change)🤖 Generated with Claude Code