Skip to content

Defer tag writes past fixed-length neighbors#6

Closed
hhugo wants to merge 24 commits into
masterfrom
prefer-end-tag-element-length
Closed

Defer tag writes past fixed-length neighbors#6
hhugo wants to merge 24 commits into
masterfrom
prefer-end-tag-element-length

Conversation

@hhugo
Copy link
Copy Markdown
Owner

@hhugo hhugo commented Apr 1, 2026

Summary

  • Prefer end tag over start tag for fixed-length element optimization — delays the tag write past the element
  • Allocate boundary tags for fixed-length tuple elements without a right-position anchor, so that as-binding tags fire as late as possible in the automaton
  • Dead tag elimination pass (Sedlex.optimize) strips unused boundary tags and remaps live ones to a dense range
  • Inner tuples communicate boundary anchors to enclosing aliases via aux's return triple

Test plan

  • Existing expect tests pass (codegen, realistic, basic)
  • New test: deferred start tag past inner prefix (("0x", Plus hexa) as x)
  • Verify dead boundary tags are eliminated (mem_cells unchanged for patterns without aliases)

🤖 Generated with Claude Code

toots and others added 24 commits June 5, 2025 07:58
Add two example calculators showing how to bridge Sedlexing.lexbuf
with ocamlyacc/menhir parsers, and document the pattern in the README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The upper bound of the surrogate rejection range was 0xdf00 instead of
0xdfff, which would have allowed U+DF01..U+DFFF through. In practice
the bug was masked by the local Uchar.of_int wrapper, but fix it for
correctness. Add comments explaining why only check_three needs the
surrogate check, and add an expect test for surrogate rejection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nity#176)

* Support nested let..in for [%sedlex.regexp?] definitions (ocaml-community#41)

Allow users to define named regexps using nested let statements, e.g.:
  let int_lit =
    let digit = [%sedlex.regexp? '0'..'9'] in
    [%sedlex.regexp? Plus digit]

Add eval_regexp_expr method that recursively evaluates let..in chains
of regexp definitions, used by both the expression handler and
structure_with_regexps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add comment to ast match

* Update documentation

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The default branch in a match%sedlex is not a regexp — it fires when
no rule matches, so zero characters are consumed and the lexeme is "".
To catch unexpected characters, use `any` instead.

Closes ocaml-community#51

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-community#181)

* Document regexp operator precedence (fixes ocaml-community#35)

Since sedlex regexps are OCaml patterns, they follow OCaml's pattern
precedence: | (lowest) < , < constructor application (highest).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Doc: add new sub sections

* cleanup

* cleanup

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
emacs/sedlex-dot.el provides two interactive commands to render/remove
DOT graph overlays directly in test buffers. render_dots.sh offers a
CLI alternative for batch-rendering DOT graphs as SVG. Both are
documented in HACKING.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
reduce memory consumption for named pattern under or pattern

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delay tag writes as late as possible: when a fixed-length element
needs only one tag, use the end tag instead of the start tag. This
reduces redundant tag operations in loops (e.g., self-loop on 'a'
no longer writes the tag on every iteration).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allocate boundary tags for fixed-length tuple elements that lack a
right-position anchor, so that as-binding tags fire as late as
possible in the automaton.

During the rights computation (right-to-left pass), when retreat
breaks at a variable-length element but the current element has a
fixed length, a boundary tag is allocated at the element's end.
This tag becomes a concrete anchor: elements further left can
compute their positions via Tag{tag; offset}.  Dead boundary tags
(unreferenced by any as-binding) are eliminated by a new
Sedlex.optimize pass that strips unused tags and remaps live ones
to a dense range.

Inner tuples communicate boundary anchors to an enclosing alias
via the third element of aux's return triple: (start_anchor,
end_anchor).  The alias picks whichever expression fires latest:
- Start_plus/End_minus always win (no tag needed).
- For the start boundary: inner anchor > outer left context.
- For the end boundary: outer right context > inner anchor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhugo hhugo closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants