Add basic support for named capture group#177
Conversation
|
cc @toots |
|
Let me know what you would like me to do with these things. |
14d80de to
b0c1d9e
Compare
This PR is ready to review. It would be nice to have another human review on it. |
|
@toots, do you think you could review this or should we look for some other reviewer ? |
The feature looks very cool and I want to help but I'm very unfamiliar with that part of the codebase. Therefore, my feedback will be about long term maintenance:
Maybe at first you could see if you can document the implementation? I gather that you're using a coding assistant. Those can be useful to do the base work and let your review using your own understanding of it to make sure that it is correct? |
- Add ppx_sedlex.mli with minimal public surface - Replace table_counter/partition_counter refs with Hashtbl.length - Expose reset_state instead of raw partitions/tables hashtables - Bake builtin_regexps and Fun.id into handle_sedlex_match - Comment out unused extensions value - Remove StringMap, builtin_regexps, regexp_of_pattern from interface Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e7ac0a6 to
79f3579
Compare
toots
left a comment
There was a problem hiding this comment.
Thanks for the documentation. This seems pretty detailed and if you're confident it is accurate, that's a great addition!
My next and hopefully last feedback: I see the tests checking a lot of the happy part.
Are we testing the error paths? Are errors and limitations of the implementations clear, expected, tested and documented for the user to understanding?
Thanks!
Add [%compile_error] test extension that applies the sedlex mapper to an expression, catches errors, and prints them with OCaml's caret display (line numbers stripped for stability). Expose map_expression in ppx_sedlex for this purpose. 27 expect tests in test/codegen/test_errors.ml covering every error path in ppx_sedlex.ml: as-binding restrictions, operator misuse, malformed strings, invalid patterns, match structure, and regexp definition errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add section covering `as` binding syntax, submatch extraction functions, or-pattern support, and operator restrictions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
I've added documentation in the readme. |
|
@pmetzger I still would like to have admin rights to this repo to enable the merge queue and auto merge. |
|
@toots Not sure what you mean by merge queue and auto merge? |
|
@pmetzger the repository can be set to protect the main branch from direct merge and auto-merge PRs when they have been accepted and pass the CI. This removes friction when reviewing. For instance, I can review some changes, approve the approach and mark for auto merge when the author has finished fixing minor CI issues. |
Summary
Based on top of #174
Add support for
asbindings in sedlex patterns, allowing users to capture sub-matches by name:This uses a tagged DFA approach (Laurikari-style) that records sub-match positions in a single forward pass, with no backtracking penalty.
What's supported
(p as x)bindsxto the text matched byp(p1 as x, p2 as y)in a sequence(p1 as x) | (p2 as x)— both branches must bind the same names; a discriminator tag determines which branch matchedmatch%sedlex: inner and outer lexers maintain independent tag stateasgenerate identical code to beforeWhat's rejected (with clear error messages)
asinside repetition operators (Star,Plus,Rep,Opt)asinside set operators (Compl,Sub,Intersect)asin named regexp definitions (let%sedlex)Implementation
Sedlex.bindwraps a regexp with start/end tag epsilon nodes__private__memarrays on the lexbuf store tag positions, saved/restored onmark/backtrack, adjusted onrefillregexp_of_patternreturns tag info alongside the regexp;gen_binding_codeemitsletbindings that extractsubmatchfrom tag positionssubmatchtype in the public API for structured access to sub-match positionsTesting
test/basic.mlcovering captures, or-patterns, nested match, and edge casestest/codegen/tracking generated code and automata structureThis is a minimal implementation — no tag optimizations are applied yet. The codegen tests document current (unoptimized) baselines to track future improvements (tag coalescing, offset propagation, etc.).
Next
If accepted, 4 PRs will follow to improve performances:
See https://github.com/hhugo/sedlex/pulls