idx=3: freeze (102 records) — Savara Loan & Security Agreement, L0 sig-shape guard#76
idx=3: freeze (102 records) — Savara Loan & Security Agreement, L0 sig-shape guard#76arthrod wants to merge 1 commit into
Conversation
…pth patterns + IWW L1 pin
Defect summary
==============
idx=3 (Savara Loan and Security Agreement) had no L0 title record at all.
The title "LOAN AND SECURITY AGREEMENT" was incorrectly demoted to L2
by the signature-page exploder because its short uppercase shape
matched _SIG_BLOCK_LABEL_RE. Beyond that, numbered subsections (N.M
like "2.1 Promise to Pay") were all landing at L1 because the bare
\d+\. pattern matched first; they should be L2 children of the topic-
only Article L1 headers ("LOAN AND TERMS OF PAYMENT", etc.). The IWW
operating clause also got demoted to L2 because doc2dict packed the
sig-page text into its body and the /s/ detection flagged it as a
sig-line carrier.
Fixes
=====
1. _explode_signature_block_lines: add a structural guard so the
parent-chain climb stops at the L0 title (depth=0) and the L2 pin
pass never demotes a depth=0 record. Title-as-root is the rubric
foundation; no signature-page text-shape can override it.
2. _LEVEL_PATTERNS: insert N.M.K (depth 3) and N.M (depth 2) regexes
BEFORE the bare \d+\. (depth 1) pattern. This makes credit/license-
agreement-style "2.1 Promise to Pay" / "2.1.1 Term Loans" land at
their correct rubric depths under topic-only Article L1 parents,
without requiring the inline-section-splitter to re-process them.
Lookahead (?![.\d]) prevents N.M from absorbing the prefix of N.M.K.
3. _explode_signature_block_lines PASS 3: do not demote a record whose
span contains "IN WITNESS WHEREOF" to L2, even if doc2dict packed
/s/ lines into its body. The IWW operating clause stays at L1
per the title-as-root signature-page rubric.
Stats
=====
idx=3 frozen: 102 records, levels {0: 1, 1: 18, 2: 82, 3: 1}
reconstruction: word_coverage=92.4% char_ratio=97.1%
regression: all 4 frozen idxs OK (0, 1, 2, 3).
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit inference engine (Custom checks)
Files:
**/*.{py,ts,tsx}📄 CodeRabbit inference engine (Custom checks)
Files:
🔍 Remote MCPBased on my search attempts, I was unable to retrieve the specific technical details from the repository itself due to DeepWiki not having the repository indexed, and Context7 quota limitations. However, I can provide you with a summary of the relevant information from your PR context and what would be important to verify during review: Summary of PR
|
| Layer / File(s) | Summary |
|---|---|
Depth-3 rubric pattern for N.M.K dotted sections scripts/parse_doc2dict_with_config.py |
_LEVEL_PATTERNS regex set expanded to recognize N.M.K form (e.g., 2.1.1) as depth 3 before existing depth-2 mapping, using lookaheads to prevent matching deeper N.M.K.L markers. |
Signature-block boundary safeguards scripts/parse_doc2dict_with_config.py |
Upward chain expansion halts when reaching L0 agreement title (depth 0 parent). L2 pinning pass defensively skips depth-0 records and avoids demoting the IWW operating clause even when text spans suggest signature inclusion. |
Freeze state bookkeeping
| Layer / File(s) | Summary |
|---|---|
Index 3 completion record data/auto_parse/level_freeze/state.json |
Frozen array extended with index 3. History appended with two freeze entries for idx 3 dated 2026-05-17T05:46:11 and 2026-05-17T05:48:19, both recording n_records: 102. |
Estimated code review effort
🎯 2 (Simple) | ⏱️ ~12 minutes
Possibly related PRs
- arthrod/clause-extract#36: Updates freeze state for different indices through the same frozen list and history entry mechanism.
- arthrod/clause-extract#31: Updates freeze state bookkeeping without the parsing script safeguard enhancements.
- arthrod/clause-extract#25: Modifies the same freeze state file for a different index value.
Suggested labels
Feat2
Poem
A rabbit hops through rubric rows,
Patterns dance where dotted goes,
Deep three-dot forms now come to light,
Boundaries held both left and right,
Index three is frozen tight! 🥕
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title accurately summarizes the main changes: adding idx=3 (a frozen baseline) and fixing the L0 signature-shape guard defect for the Savara Loan agreement. |
| Description check | ✅ Passed | The description is comprehensive and directly related to the changeset, covering the parser fixes, verified output, test results, and the structural improvements to the document hierarchy. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
📝 Generate docstrings
- Create stacked PR
- Commit on current branch
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Code Review
This pull request enhances the document parsing logic by adding support for nested section markers (N.M and N.M.K) and implementing safeguards to ensure the L0 agreement title and IWW clauses are not misclassified as signature block elements. It also updates the state tracking for level freezing. Feedback was provided regarding the regex lookahead assertions for the new section markers, which were found to be too restrictive as they would block markers ending with a trailing period. Suggestions were made to adjust these patterns to correctly handle trailing dots while maintaining the intended prefix protection.
| # ("2.1.1 Term Loans."), so the trailing char is unconstrained. | ||
| # MUST appear before N.M and N. patterns so the deeper structure | ||
| # wins. See task_rules/level_rubric.md "Worked depth chain". | ||
| (3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])")), |
There was a problem hiding this comment.
The lookahead (?![.\d]) is too restrictive as it prevents matching section markers that end with a trailing period (e.g., "2.1.1. Term Loans"), which is a common style in legal agreements. Since re.match is used and the loop checks patterns in order, the lookahead only needs to prevent matching a prefix of a deeper level (like "2.1.1.1"). Changing the lookahead to (?!\.\d) allows the trailing dot while still blocking the prefix match of a deeper numeric level.
| (3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])")), | |
| (3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?!\.\d)")), |
| # before the bare N. pattern so credit/license-style numbered | ||
| # subsections land at depth 2 under their topic-Article L1 | ||
| # parent. See task_rules/level_rubric.md depth chain. | ||
| (2, re.compile(r"^(\d+)\.(\d+)(?![.\d])")), |
There was a problem hiding this comment.
Similar to the pattern above, the lookahead (?![.\d]) here prevents matching "N.M." style markers (e.g., "2.1. Promise to Pay"). Using (?!\.\d) instead will correctly allow the trailing period while preventing this pattern from incorrectly matching the prefix of an "N.M.K" marker.
| (2, re.compile(r"^(\d+)\.(\d+)(?![.\d])")), | |
| (2, re.compile(r"^(\d+)\.(\d+)(?!\.\d)")), |
| # ("2.1.1 Term Loans."), so the trailing char is unconstrained. | ||
| # MUST appear before N.M and N. patterns so the deeper structure | ||
| # wins. See task_rules/level_rubric.md "Worked depth chain". | ||
| (3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])")), |
There was a problem hiding this comment.
Suggestion: The N.M.K matcher rejects headings that include a trailing dot (for example 2.1.1. Term Loans) because the negative lookahead forbids . after the third component. Those common headings then fall through to the bare N. rule and get assigned the wrong level, flattening deep clause structure. Allow an optional terminal dot while still preventing longer numeric chains from being partially matched. [incorrect condition logic]
Severity Level: Major ⚠️
- ❌ Deep subsections mis-leveled as top-level clauses.
- ⚠️ JSONL hierarchy no longer matches legal clause nesting.
- ⚠️ Level-based downstream analyses see flattened structure.Steps of Reproduction ✅
1. Run the parser entrypoint `main()` in `scripts/parse_doc2dict_with_config.py:4–26`,
e.g. via `uv run scripts/parse_doc2dict_with_config.py --limit 1 --output-dir
data/auto_parse`, so that it processes an EX-10 agreement whose doc2dict section tree
contains a heading text like `"2.1.1. Term Loans"` (three-part number followed by a
terminating period), a format explicitly discussed in
`src/clause_extract/agreement_config.py:10–13` where terminator periods after numbers such
as `"Section 2.1."` and `"Article 1.1."` are handled.
2. Inside `parse_one()` (`scripts/parse_doc2dict_with_config.py:1–37`), the HTML is
converted to a `document_tree` via `html2dict(..., mapping_dict=_AGREEMENT_CONFIG)`, and
then `walk_sections(document_tree, depth=0)` is called
(`scripts/parse_doc2dict_with_config.py:36–37`), which iterates doc2dict section nodes,
including the `"2.1.1. Term Loans"` node.
3. For that section node, `walk_sections()` calls `_remap_depth(title, cls, depth,
parent_level=parent_level)` at `scripts/parse_doc2dict_with_config.py:76`, which in turn
calls `_infer_level_from_title(title or "")`
(`scripts/parse_doc2dict_with_config.py:21–23,165–175`). `_infer_level_from_title` loops
over `_LEVEL_PATTERNS` (`scripts/parse_doc2dict_with_config.py:25–63`): the three-part
pattern `(3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])"))` at line 147 fails to match
`"2.1.1. Term Loans"` because the character immediately after the third numeric group is
`"."`, which is explicitly forbidden by the negative lookahead `(?![.\d])`.
4. Because the `N.M.K` pattern fails (and the `N.M` pattern at line 155 also fails for the
same reason), `_infer_level_from_title` continues to the bare `N.` pattern `(1,
re.compile(r"^(\d+)\."))` at line 156, which successfully matches the `"2."` prefix of
`"2.1.1. Term Loans"`. It therefore returns level `1` instead of the intended level `3`.
`_remap_depth` uses this misinferred level for the section row
(`scripts/parse_doc2dict_with_config.py:21–33, 96–103, 32–33`), so in the JSONL written by
`main()` (`parse_doc2dict_with_config_nodes.jsonl`), the `"2.1.1. Term Loans"` heading and
its descendants are flattened into top-level body depth, breaking the intended
parent-child hierarchy for three-part numbered clauses.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 147:147
**Comment:**
*Incorrect Condition Logic: The `N.M.K` matcher rejects headings that include a trailing dot (for example `2.1.1. Term Loans`) because the negative lookahead forbids `.` after the third component. Those common headings then fall through to the bare `N.` rule and get assigned the wrong level, flattening deep clause structure. Allow an optional terminal dot while still preventing longer numeric chains from being partially matched.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| # before the bare N. pattern so credit/license-style numbered | ||
| # subsections land at depth 2 under their topic-Article L1 | ||
| # parent. See task_rules/level_rubric.md depth chain. | ||
| (2, re.compile(r"^(\d+)\.(\d+)(?![.\d])")), |
There was a problem hiding this comment.
Suggestion: The N.M matcher has the same lookahead issue and fails on headings like 2.1. Definitions, so they also fall through to the bare N. pattern and become level 1 instead of level 2. This breaks parent-child hierarchy for agreements that punctuate subsection numbers with a trailing dot. [incorrect condition logic]
Severity Level: Major ⚠️
- ❌ Two-part subsections downgraded from level 2 to level 1.
- ⚠️ Children appear deeper than their mis-leveled parent.
- ⚠️ Consumers of level rubric misinterpret agreement structure.Steps of Reproduction ✅
1. Run `main()` in `scripts/parse_doc2dict_with_config.py:4–26` over an EX-10 agreement
whose doc2dict section tree contains a heading like `"2.1. Definitions"` rendered as
`"2.1. Definitions"` (i.e., the section number `"2.1."` followed by a space and title
text). The existence of terminator periods immediately after section numbers such as
`"Section 2.1."` is documented in `src/clause_extract/agreement_config.py:10–13`, where a
`\.?` slot was added specifically to handle this punctuation.
2. In `parse_one()` (`scripts/parse_doc2dict_with_config.py:1–37`), `html2dict` builds
`document_tree`, and `walk_sections(document_tree, depth=0)` is invoked
(`scripts/parse_doc2dict_with_config.py:36–37`). As `walk_sections` iterates nodes
(`scripts/parse_doc2dict_with_config.py:38–75, 96–115`), it encounters the `"2.1.
Definitions"` section and calls `_remap_depth(title, cls, depth,
parent_level=parent_level)` at line 76.
3. `_remap_depth` calls `_infer_level_from_title(title or "")`
(`scripts/parse_doc2dict_with_config.py:21–23,165–175`), which tests `_LEVEL_PATTERNS` in
order (`scripts/parse_doc2dict_with_config.py:25–63`). For the title `"2.1. Definitions"`,
the two-part `N.M` matcher `(2, re.compile(r"^(\d+)\.(\d+)(?![.\d])"))` at line 155 fails
because immediately after the second digit `"1"` the next character is `"."`, which the
negative lookahead `(?![.\d])` explicitly forbids; thus the `N.M` pattern does not match
headings that punctuate the number as `"2.1."`.
4. After the `N.M` pattern fails, `_infer_level_from_title` proceeds to the bare `N.` rule
`(1, re.compile(r"^(\d+)\."))` at line 156, which matches only the `"2."` prefix of `"2.1.
Definitions"` and causes `_infer_level_from_title` to return level `1` instead of the
intended level `2`. `_remap_depth` then uses this incorrect level
(`scripts/parse_doc2dict_with_config.py:21–33, 96–103`), so the `"2.1. Definitions"`
header is emitted as a top-level body clause in the JSONL, while its lettered or numbered
children are assigned deeper levels. This breaks the intended parent-child hierarchy for
common two-part subsection headings that include a trailing period after the number.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 155:155
**Comment:**
*Incorrect Condition Logic: The `N.M` matcher has the same lookahead issue and fails on headings like `2.1. Definitions`, so they also fall through to the bare `N.` pattern and become level 1 instead of level 2. This breaks parent-child hierarchy for agreements that punctuate subsection numbers with a trailing dot.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
|
Triage agent — PR #76 comment review (read-only pass, no code changes) 4 inline comments reviewed (2 unique issues, raised by both Gemini and CodeAnt):
NEED-DISCUSSION (1 logical issue, 2 affected patterns): Trailing-dot lookahead fix in Triage only — no code changes made this round. |
User description
Summary
Fourth stacked PR. Adds idx=3 (LOAN AND SECURITY AGREEMENT, Savara Inc., credit/loan agreement) as the fourth verified frozen baseline on top of idx=2 (PR #75).
This PR fixes a critical foundational defect: idx=3 was emitting zero L0 records because the title "LOAN AND SECURITY AGREEMENT" — short and all-caps — matched the signature-page label shape and was being demoted to L2 by the sig-explosion logic. Three surgical fixes restore title-as-root for this entire class of agreements.
Parser changes (3 surgical, shape-driven, no phrase blocklists)
_explode_signature_block_linesL0 guard — parent-chain climb in PASS 2 stops at anydepth == 0record; PASS 3 depth-pinning loop defensively skips depth-0 records. Title-as-root is structural foundation; no sig-shape text can override it._LEVEL_PATTERNSN.M / N.M.K patterns inserted before bare\d+\.:Lookaheads prevent N.M from absorbing the prefix of N.M.K. "1. Indemnity" (single number) still → L1 per the prior pattern.
IWW L1 pin defense — PASS 3 of
_explode_signature_block_linesskips any record whose span contains an IWW clause (via_is_iww_clause), preserving the L1 depth set by PASS 2 even when doc2dict packs sig-page text into the IWW body.Verified output for idx=3
{L0:1, L1:18, L2:82, L3:1}Top-level structure
Known minor quirks (non-blocking; inspector concurs)
Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 4 --no-truncate --output-dir data/auto_parseexits 0 withok 4uv run scripts/level_loop/freeze.py 3 --forcereports word_coverage ≥ 90%uv run scripts/level_loop/regress.pyreports all 4 frozen idxs OKSource
http://www.sec.gov/Archives/edgar/data/1160308/000156459017009915/svra-ex103_269.htm
Why this matters for the corpus
The three-layer guard (L0 depth-0 protection + N.M precedence + IWW L1 pin) generalizes. Any subsequent agreement with:
…will now be correctly handled. This is foundational for the remaining 1062 corpus rows.
🤖 Generated with Claude Code
CodeAnt-AI Description
Keep the agreement title at the root and place numbered sections at the right depth
What Changed
2.1and2.1.1now appear under the correct parent section instead of being flattened to the wrong levelImpact
✅ Correct agreement structure✅ Fewer misplaced section headings✅ More reliable frozen baselines🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.