Skip to content

Add basic support for named capture group#177

Merged
toots merged 5 commits into
ocaml-community:masterfrom
hhugo:name-minimal
Mar 25, 2026
Merged

Add basic support for named capture group#177
toots merged 5 commits into
ocaml-community:masterfrom
hhugo:name-minimal

Conversation

@hhugo
Copy link
Copy Markdown
Collaborator

@hhugo hhugo commented Feb 10, 2026

Summary

Based on top of #174
Add support for as bindings in sedlex patterns, allowing users to capture sub-matches by name:

match%sedlex buf with
| ("hello", (Star any as rest)) -> print_string (Utf8.of_submatch rest)
| _ -> ()

This uses a tagged DFA approach (Laurikari-style) that records sub-match positions in a single forward pass, with no backtracking penalty.

What's supported

  • Basic captures: (p as x) binds x to the text matched by p
  • Multiple captures: (p1 as x, p2 as y) in a sequence
  • Or-patterns: (p1 as x) | (p2 as x) — both branches must bind the same names; a discriminator tag determines which branch matched
  • Nested match%sedlex: inner and outer lexers maintain independent tag state
  • Zero overhead when unused: rules without as generate identical code to before

What's rejected (with clear error messages)

  • as inside repetition operators (Star, Plus, Rep, Opt)
  • as inside set operators (Compl, Sub, Intersect)
  • as in named regexp definitions (let%sedlex)
  • Or-patterns where branches bind different names

Implementation

  • NFA: nodes carry optional tags; Sedlex.bind wraps a regexp with start/end tag epsilon nodes
  • DFA: epsilon closure collects tags; transitions carry tag action lists
  • Runtime: new __private__mem arrays on the lexbuf store tag positions, saved/restored on mark/backtrack, adjusted on refill
  • PPX: regexp_of_pattern returns tag info alongside the regexp; gen_binding_code emits let bindings that extract submatch from tag positions
  • New submatch type in the public API for structured access to sub-match positions

Testing

  • Runtime tests in test/basic.ml covering captures, or-patterns, nested match, and edge cases
  • Codegen baseline tests in test/codegen/ tracking generated code and automata structure
  • DOT graph output in test PPX for visual inspection of tagged automata

This is a minimal implementation — no tag optimizations are applied yet. The codegen tests document current (unoptimized) baselines to track future improvements (tag coalescing, offset propagation, etc.).

Next

If accepted, 4 PRs will follow to improve performances:
See https://github.com/hhugo/sedlex/pulls

@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Feb 10, 2026

cc @toots

@pmetzger
Copy link
Copy Markdown
Member

Let me know what you would like me to do with these things.

@hhugo hhugo force-pushed the name-minimal branch 2 times, most recently from 14d80de to b0c1d9e Compare March 11, 2026 16:12
@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Mar 11, 2026

Let me know what you would like me to do with these things.

This PR is ready to review. It would be nice to have another human review on it.

@pmetzger
Copy link
Copy Markdown
Member

@toots @Drup Thoughts?

@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Mar 19, 2026

@toots, do you think you could review this or should we look for some other reviewer ?

@toots
Copy link
Copy Markdown
Member

toots commented Mar 24, 2026

@toots, do you think you could review this or should we look for some other reviewer ?

The feature looks very cool and I want to help but I'm very unfamiliar with that part of the codebase.

Therefore, my feedback will be about long term maintenance:

  • Is it possible to document the implementation better?
  • Is it possible to organize the implementation so that different parts are isolated and easy to reason about etc.

Maybe at first you could see if you can document the implementation? I gather that you're using a coding assistant. Those can be useful to do the base work and let your review using your own understanding of it to make sure that it is correct?

hhugo and others added 2 commits March 24, 2026 10:39
- Add ppx_sedlex.mli with minimal public surface
- Replace table_counter/partition_counter refs with Hashtbl.length
- Expose reset_state instead of raw partitions/tables hashtables
- Bake builtin_regexps and Fun.id into handle_sedlex_match
- Comment out unused extensions value
- Remove StringMap, builtin_regexps, regexp_of_pattern from interface

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhugo hhugo force-pushed the name-minimal branch 2 times, most recently from e7ac0a6 to 79f3579 Compare March 24, 2026 13:27
Copy link
Copy Markdown
Member

@toots toots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the documentation. This seems pretty detailed and if you're confident it is accurate, that's a great addition!

My next and hopefully last feedback: I see the tests checking a lot of the happy part.

Are we testing the error paths? Are errors and limitations of the implementations clear, expected, tested and documented for the user to understanding?

Thanks!

Comment thread src/syntax/ppx_sedlex.ml
Comment thread src/syntax/ppx_sedlex.ml
hhugo and others added 2 commits March 24, 2026 16:41
Add [%compile_error] test extension that applies the sedlex mapper to
an expression, catches errors, and prints them with OCaml's caret
display (line numbers stripped for stability). Expose map_expression
in ppx_sedlex for this purpose.

27 expect tests in test/codegen/test_errors.ml covering every error
path in ppx_sedlex.ml: as-binding restrictions, operator misuse,
malformed strings, invalid patterns, match structure, and regexp
definition errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add section covering `as` binding syntax, submatch extraction
functions, or-pattern support, and operator restrictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Mar 24, 2026

Are we testing the error paths? Are errors and limitations of the implementations clear, expected, tested and documented for the user to understanding?

I've added documentation in the readme.

@hhugo hhugo requested a review from toots March 24, 2026 18:52
@hhugo
Copy link
Copy Markdown
Collaborator Author

hhugo commented Mar 25, 2026

@pmetzger, this can be merged after #190

@toots
Copy link
Copy Markdown
Member

toots commented Mar 25, 2026

@pmetzger I still would like to have admin rights to this repo to enable the merge queue and auto merge.

@toots toots merged commit c2b1bcd into ocaml-community:master Mar 25, 2026
8 checks passed
This was referenced Mar 25, 2026
@pmetzger
Copy link
Copy Markdown
Member

@toots Not sure what you mean by merge queue and auto merge?

@toots
Copy link
Copy Markdown
Member

toots commented Mar 26, 2026

@pmetzger the repository can be set to protect the main branch from direct merge and auto-merge PRs when they have been accepted and pass the CI. This removes friction when reviewing. For instance, I can review some changes, approve the approach and mark for auto merge when the author has finished fixing minor CI issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants