Skip to content

idx=3: freeze (102 records) — Savara Loan & Security Agreement, L0 sig-shape guard#76

Open
arthrod wants to merge 1 commit into
redo/idx-2from
redo/idx-3
Open

idx=3: freeze (102 records) — Savara Loan & Security Agreement, L0 sig-shape guard#76
arthrod wants to merge 1 commit into
redo/idx-2from
redo/idx-3

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Fourth stacked PR. Adds idx=3 (LOAN AND SECURITY AGREEMENT, Savara Inc., credit/loan agreement) as the fourth verified frozen baseline on top of idx=2 (PR #75).

This PR fixes a critical foundational defect: idx=3 was emitting zero L0 records because the title "LOAN AND SECURITY AGREEMENT" — short and all-caps — matched the signature-page label shape and was being demoted to L2 by the sig-explosion logic. Three surgical fixes restore title-as-root for this entire class of agreements.

Parser changes (3 surgical, shape-driven, no phrase blocklists)

  1. _explode_signature_block_lines L0 guard — parent-chain climb in PASS 2 stops at any depth == 0 record; PASS 3 depth-pinning loop defensively skips depth-0 records. Title-as-root is structural foundation; no sig-shape text can override it.

  2. _LEVEL_PATTERNS N.M / N.M.K patterns inserted before bare \d+\.:

    (3, r'^(\d+)\.(\d+)\.(\d+)(?![.\d])')   # 2.1.1 → L3
    (2, r'^(\d+)\.(\d+)(?![.\d])')          # 2.1 → L2

    Lookaheads prevent N.M from absorbing the prefix of N.M.K. "1. Indemnity" (single number) still → L1 per the prior pattern.

  3. IWW L1 pin defense — PASS 3 of _explode_signature_block_lines skips any record whose span contains an IWW clause (via _is_iww_clause), preserving the L1 depth set by PASS 2 even when doc2dict packs sig-page text into the IWW body.

Verified output for idx=3

  • 102 records, distribution {L0:1, L1:18, L2:82, L3:1}
  • Reconstruction: word_coverage 92.4%, char_ratio 97.1% (above 90% gate)
  • Max depth: 3

Top-level structure

o=0   L0 : LOAN AND SECURITY AGREEMENT                  ← was missing entirely before fix
o=1   L1 : THIS LOAN AND SECURITY AGREEMENT (as amended... (preamble)
o=2   L1 : ACCOUNTING AND OTHER TERMS                    (topic-only Article 1)
o=3   L1 : LOAN AND TERMS OF PAYMENT                     (topic-only Article 2)
o=4   L2 : 2.1 Promise to Pay...                          (was L1 before fix)
o=5   L3 : 2.1.1 Term Loans...                            (was L1 before fix)
o=7   L2 : 2.3 Payment of Interest...
o=8   L2 : 2.4 Fees...
o=11  L1 : CONDITIONS OF LOANS
o=12  L2 : 3.1 Conditions Precedent...
...
o=100 L1 : IN WITNESS WHEREOF, the parties hereto...     (was being demoted to L2)
o=101 L1 : EXHIBIT A                                      (header only — doc2dict body lost upstream)

Known minor quirks (non-blocking; inspector concurs)

  1. Order 44 is an orphaned continuation fragment from section 6.10 ("audits shall be at Co-Borrowers' expense…") promoted to L1 mid-clause. Pre-existing doc2dict body-split artifact, not introduced by these fixes. 92.4% reconstruction confirms no data loss.
  2. EXHIBIT A at order=101 carries only its header with no body or subdoc-penalty children. Collateral description appears absent from doc2dict's output at that boundary — upstream HTML extraction issue, not a parser regression.

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 4 --no-truncate --output-dir data/auto_parse exits 0 with ok 4
  • uv run scripts/level_loop/freeze.py 3 --force reports word_coverage ≥ 90%
  • uv run scripts/level_loop/regress.py reports all 4 frozen idxs OK
  • Manual visual verification of all 3 fixes by independent inspector agent (PASS verdict)
  • All 4 idxs byte-identical regress after the parser changes

Source

http://www.sec.gov/Archives/edgar/data/1160308/000156459017009915/svra-ex103_269.htm

Why this matters for the corpus

The three-layer guard (L0 depth-0 protection + N.M precedence + IWW L1 pin) generalizes. Any subsequent agreement with:

  • An inline-title structure (title baked into the preamble's "THIS X AGREEMENT..." pattern)
  • A short ALL-CAPS title that could match sig-label shape
  • Sig-page text packed into the IWW record's body by doc2dict

…will now be correctly handled. This is foundational for the remaining 1062 corpus rows.

🤖 Generated with Claude Code


CodeAnt-AI Description

Keep the agreement title at the root and place numbered sections at the right depth

What Changed

  • The agreement title now stays as the top-level record instead of being treated like a signature line
  • Numbered sections such as 2.1 and 2.1.1 now appear under the correct parent section instead of being flattened to the wrong level
  • The main signature clause is no longer pushed down to a lower level when its text is merged with signature-page content
  • idx=3 is now frozen successfully with 102 records

Impact

✅ Correct agreement structure
✅ Fewer misplaced section headings
✅ More reliable frozen baselines

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…pth patterns + IWW L1 pin

Defect summary
==============
idx=3 (Savara Loan and Security Agreement) had no L0 title record at all.
The title "LOAN AND SECURITY AGREEMENT" was incorrectly demoted to L2
by the signature-page exploder because its short uppercase shape
matched _SIG_BLOCK_LABEL_RE. Beyond that, numbered subsections (N.M
like "2.1 Promise to Pay") were all landing at L1 because the bare
\d+\. pattern matched first; they should be L2 children of the topic-
only Article L1 headers ("LOAN AND TERMS OF PAYMENT", etc.). The IWW
operating clause also got demoted to L2 because doc2dict packed the
sig-page text into its body and the /s/ detection flagged it as a
sig-line carrier.

Fixes
=====
1. _explode_signature_block_lines: add a structural guard so the
   parent-chain climb stops at the L0 title (depth=0) and the L2 pin
   pass never demotes a depth=0 record. Title-as-root is the rubric
   foundation; no signature-page text-shape can override it.

2. _LEVEL_PATTERNS: insert N.M.K (depth 3) and N.M (depth 2) regexes
   BEFORE the bare \d+\. (depth 1) pattern. This makes credit/license-
   agreement-style "2.1 Promise to Pay" / "2.1.1 Term Loans" land at
   their correct rubric depths under topic-only Article L1 parents,
   without requiring the inline-section-splitter to re-process them.
   Lookahead (?![.\d]) prevents N.M from absorbing the prefix of N.M.K.

3. _explode_signature_block_lines PASS 3: do not demote a record whose
   span contains "IN WITNESS WHEREOF" to L2, even if doc2dict packed
   /s/ lines into its body. The IWW operating clause stays at L1
   per the title-as-root signature-page rubric.

Stats
=====
idx=3 frozen: 102 records, levels {0: 1, 1: 18, 2: 82, 3: 1}
  reconstruction: word_coverage=92.4% char_ratio=97.1%
regression: all 4 frozen idxs OK (0, 1, 2, 3).
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2295485f-2807-48b9-8952-e0c2e121ab9c

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9c160 and 7c6de32.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_3.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🔍 Remote MCP

Based on my search attempts, I was unable to retrieve the specific technical details from the repository itself due to DeepWiki not having the repository indexed, and Context7 quota limitations. However, I can provide you with a summary of the relevant information from your PR context and what would be important to verify during review:

Summary of PR #76 Review Context

Key Changes Overview

1. State File Update (data/auto_parse/level_freeze/state.json)

  • Extended frozen list to include index 3 (Savara Inc. LOAN AND SECURITY AGREEMENT)
  • Appended two new freeze records at idx: 3 with 102 records each
  • Timestamps: 2026-05-17T05:46:11 and 2026-05-17T05:48:19

2. Parser Logic Enhancement (scripts/parse_doc2dict_with_config.py)

The PR implements three targeted, shape-driven parser changes:

Change 1: L0 Title Guard in Signature Block Exploder

  • PASS 2: Parent-chain climb now stops at depth==0 (prevents climbing above document title)
  • PASS 3: Depth-pinning skips depth-0 records (prevents signature-shape text from demoting title)
  • This prevents the short, all-caps title "LOAN AND SECURITY AGREEMENT" from being misclassified as signature-page label

Change 2: Rubric Depth Pattern Ordering

  • Inserted patterns for N.M.K (depth 3) and N.M (depth 2) before the bare \d+\. pattern
  • Added lookaheads to prevent N.M from accidentally consuming N.M.K patterns
  • Ensures subsections like "2.1" and "2.1.1" receive correct depths under topic-only Article L1 headers

Change 3: IWW L1 Pin

  • PASS 3 now skips records whose span contains an IWW clause (via _is_iww_clause)
  • Preserves L1 depth when doc2dict packs signature-page text into the IWW body
  • Prevents demotion of the Intra-Word Warranty clause

Verification Results

idx=3 Output: 102 records

  • Level distribution: L0:1, L1:18, L2:82, L3:1
  • Reconstruction metrics: 92.4% word_coverage, 97.1% char_ratio, max depth 3
  • Top-level structure restored (L0 title, L1 preamble and articles, corrected L2/L3 subsections)

Known Quirks (Not Regressions)

  • One orphaned continuation fragment
  • EXHIBIT A header-only record
  • Both attributed to upstream doc2dict/HTML extraction, not parser regressions

Test Plan Completion

  • ✓ Parsing completed
  • ✓ Freeze procedure passed
  • ✓ Regression tests passed (byte-identical for all frozen indices)
  • ✓ Inspector verification completed

Related PRs Pattern

The related PRs (#36, #25, #19, #16, #21, #39, #31, #33) all follow a similar pattern of freezing different indices with state bookkeeping updates, though PR #76 is unique in including substantive parser logic changes rather than just state updates.

Note: The user requirements mentioned "Stitch" for design recommendations, but this appears to be not applicable to this parser/data-handling PR focused on legal document structure extraction rather than UI/design work.

🔇 Additional comments (5)
data/auto_parse/level_freeze/state.json (1)

6-7: LGTM!

Also applies to: 116-127

scripts/parse_doc2dict_with_config.py (4)

3630-3650: Verify IWW clause preservation guard prevents demotion to L2.

The guards at lines 3635-3646 correctly prevent depth pinning from demoting the L0 agreement title and the IWW operating clause. Evidence from frozen output confirms IWW clauses remain at level 1 even when signature marks are packed into their body, validating that the _is_iww_clause guard successfully prevents the unintended L1 → L2 demotion.


3537-3551: Verify L0 title guard prevents demotion in signature block expansion.

The L0 guard at lines 3548-3551 correctly stops the upward climb when reaching the agreement title (depth==0), preventing short all-caps titles like "LOAN AND SECURITY AGREEMENT" from being misclassified as signature-block labels.

Run the following script to verify the guard is effective:

#!/bin/bash
# Description: Verify L0 title preservation in signature block handling.

# Search for test cases covering short all-caps agreement titles
rg -n -C5 --type=py -g '*test*.py' 'LOAN.*AGREEMENT|depth.*0.*title'

# Search for frozen baseline with idx=3 (Savara case mentioned in PR)
fd -e json -e jsonl . data/auto_parse/level_freeze/ --exec rg -l 'idx.*3'

# Verify the idx=3 frozen output has L0 title at depth 0, not depth 2
ast-grep --pattern $'{"idx": 3, "level": 0, $$$}'

Expected: Confirm that idx=3 (Savara LOAN AND SECURITY AGREEMENT) has:

  • L0 record at level 0 with title "LOAN AND SECURITY AGREEMENT"
  • NO signature-line classification for the L0 title
  • 102 total records with level distribution {L0:1, L1:18, L2:82, L3:1}

139-155: Pattern implementation is correct and exempted from unit test requirement.

The N.M.K (depth 3) and N.M (depth 2) regex patterns are properly formulated with lookaheads to prevent overlap, and pattern ordering ensures deeper structures match first. As a utility script with no new exported functions, this file is exempt from the unit test file requirement per coding guidelines. The PR's manual testing validation is adequate.


139-155: Implementation looks correct; three depth-pattern and guard fixes are well-reasoned.

The three parser changes are properly implemented:

  1. N.M.K and N.M depth patterns (lines 139-155): Correctly uses negative lookaheads to prevent pattern overlap. The patterns match documented examples ("2.1.1 Term Loans" at depth 3, "2.1 Promise to Pay" at depth 2) confirmed in frozen test baseline idx_3.jsonl.

  2. L0 upward climb guard (lines 3548-3551): Stops parent-chain climb at depth==0, preventing short all-caps agreement titles from being misclassified as signature-block labels.

  3. L0 and IWW depth pinning guards (lines 3635-3646): Skips depth-0 records and IWW clause records during depth pinning, preserving L0 title at depth 0 and IWW clause at L1.

All three are shape-driven fixes (no phrase blocklists) handling the defect where "LOAN AND SECURITY AGREEMENT" was misclassified as signature-page label.

			> Likely an incorrect or invalid review comment.

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced document parsing logic to more accurately recognize section hierarchy levels.
    • Added safeguards to prevent incorrect handling of document titles and key structural elements during processing.
  • Chores

    • Updated state configuration data with additional processing records.

Walkthrough

The PR adds depth-3 regex pattern matching for N.M.K dotted section rubric headings, strengthens signature-block explosion logic to preserve L0 agreement title and L1 IWW clause boundaries, and records completion of dataset index 3 in the freeze state bookkeeping.

Changes

Document structure parsing enhancements

Layer / File(s) Summary
Depth-3 rubric pattern for N.M.K dotted sections
scripts/parse_doc2dict_with_config.py
_LEVEL_PATTERNS regex set expanded to recognize N.M.K form (e.g., 2.1.1) as depth 3 before existing depth-2 mapping, using lookaheads to prevent matching deeper N.M.K.L markers.
Signature-block boundary safeguards
scripts/parse_doc2dict_with_config.py
Upward chain expansion halts when reaching L0 agreement title (depth 0 parent). L2 pinning pass defensively skips depth-0 records and avoids demoting the IWW operating clause even when text spans suggest signature inclusion.

Freeze state bookkeeping

Layer / File(s) Summary
Index 3 completion record
data/auto_parse/level_freeze/state.json
Frozen array extended with index 3. History appended with two freeze entries for idx 3 dated 2026-05-17T05:46:11 and 2026-05-17T05:48:19, both recording n_records: 102.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

Feat2

Poem

A rabbit hops through rubric rows,
Patterns dance where dotted goes,
Deep three-dot forms now come to light,
Boundaries held both left and right,
Index three is frozen tight! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding idx=3 (a frozen baseline) and fixing the L0 signature-shape guard defect for the Savara Loan agreement.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the parser fixes, verified output, test results, and the structural improvements to the document hierarchy.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the document parsing logic by adding support for nested section markers (N.M and N.M.K) and implementing safeguards to ensure the L0 agreement title and IWW clauses are not misclassified as signature block elements. It also updates the state tracking for level freezing. Feedback was provided regarding the regex lookahead assertions for the new section markers, which were found to be too restrictive as they would block markers ending with a trailing period. Suggestions were made to adjust these patterns to correctly handle trailing dots while maintaining the intended prefix protection.

# ("2.1.1 Term Loans."), so the trailing char is unconstrained.
# MUST appear before N.M and N. patterns so the deeper structure
# wins. See task_rules/level_rubric.md "Worked depth chain".
(3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The lookahead (?![.\d]) is too restrictive as it prevents matching section markers that end with a trailing period (e.g., "2.1.1. Term Loans"), which is a common style in legal agreements. Since re.match is used and the loop checks patterns in order, the lookahead only needs to prevent matching a prefix of a deeper level (like "2.1.1.1"). Changing the lookahead to (?!\.\d) allows the trailing dot while still blocking the prefix match of a deeper numeric level.

Suggested change
(3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])")),
(3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?!\.\d)")),

# before the bare N. pattern so credit/license-style numbered
# subsections land at depth 2 under their topic-Article L1
# parent. See task_rules/level_rubric.md depth chain.
(2, re.compile(r"^(\d+)\.(\d+)(?![.\d])")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the pattern above, the lookahead (?![.\d]) here prevents matching "N.M." style markers (e.g., "2.1. Promise to Pay"). Using (?!\.\d) instead will correctly allow the trailing period while preventing this pattern from incorrectly matching the prefix of an "N.M.K" marker.

Suggested change
(2, re.compile(r"^(\d+)\.(\d+)(?![.\d])")),
(2, re.compile(r"^(\d+)\.(\d+)(?!\.\d)")),

# ("2.1.1 Term Loans."), so the trailing char is unconstrained.
# MUST appear before N.M and N. patterns so the deeper structure
# wins. See task_rules/level_rubric.md "Worked depth chain".
(3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The N.M.K matcher rejects headings that include a trailing dot (for example 2.1.1. Term Loans) because the negative lookahead forbids . after the third component. Those common headings then fall through to the bare N. rule and get assigned the wrong level, flattening deep clause structure. Allow an optional terminal dot while still preventing longer numeric chains from being partially matched. [incorrect condition logic]

Severity Level: Major ⚠️
- ❌ Deep subsections mis-leveled as top-level clauses.
- ⚠️ JSONL hierarchy no longer matches legal clause nesting.
- ⚠️ Level-based downstream analyses see flattened structure.
Steps of Reproduction ✅
1. Run the parser entrypoint `main()` in `scripts/parse_doc2dict_with_config.py:4–26`,
e.g. via `uv run scripts/parse_doc2dict_with_config.py --limit 1 --output-dir
data/auto_parse`, so that it processes an EX-10 agreement whose doc2dict section tree
contains a heading text like `"2.1.1. Term Loans"` (three-part number followed by a
terminating period), a format explicitly discussed in
`src/clause_extract/agreement_config.py:10–13` where terminator periods after numbers such
as `"Section 2.1."` and `"Article 1.1."` are handled.

2. Inside `parse_one()` (`scripts/parse_doc2dict_with_config.py:1–37`), the HTML is
converted to a `document_tree` via `html2dict(..., mapping_dict=_AGREEMENT_CONFIG)`, and
then `walk_sections(document_tree, depth=0)` is called
(`scripts/parse_doc2dict_with_config.py:36–37`), which iterates doc2dict section nodes,
including the `"2.1.1. Term Loans"` node.

3. For that section node, `walk_sections()` calls `_remap_depth(title, cls, depth,
parent_level=parent_level)` at `scripts/parse_doc2dict_with_config.py:76`, which in turn
calls `_infer_level_from_title(title or "")`
(`scripts/parse_doc2dict_with_config.py:21–23,165–175`). `_infer_level_from_title` loops
over `_LEVEL_PATTERNS` (`scripts/parse_doc2dict_with_config.py:25–63`): the three-part
pattern `(3, re.compile(r"^(\d+)\.(\d+)\.(\d+)(?![.\d])"))` at line 147 fails to match
`"2.1.1. Term Loans"` because the character immediately after the third numeric group is
`"."`, which is explicitly forbidden by the negative lookahead `(?![.\d])`.

4. Because the `N.M.K` pattern fails (and the `N.M` pattern at line 155 also fails for the
same reason), `_infer_level_from_title` continues to the bare `N.` pattern `(1,
re.compile(r"^(\d+)\."))` at line 156, which successfully matches the `"2."` prefix of
`"2.1.1. Term Loans"`. It therefore returns level `1` instead of the intended level `3`.
`_remap_depth` uses this misinferred level for the section row
(`scripts/parse_doc2dict_with_config.py:21–33, 96–103, 32–33`), so in the JSONL written by
`main()` (`parse_doc2dict_with_config_nodes.jsonl`), the `"2.1.1. Term Loans"` heading and
its descendants are flattened into top-level body depth, breaking the intended
parent-child hierarchy for three-part numbered clauses.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 147:147
**Comment:**
	*Incorrect Condition Logic: The `N.M.K` matcher rejects headings that include a trailing dot (for example `2.1.1. Term Loans`) because the negative lookahead forbids `.` after the third component. Those common headings then fall through to the bare `N.` rule and get assigned the wrong level, flattening deep clause structure. Allow an optional terminal dot while still preventing longer numeric chains from being partially matched.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

# before the bare N. pattern so credit/license-style numbered
# subsections land at depth 2 under their topic-Article L1
# parent. See task_rules/level_rubric.md depth chain.
(2, re.compile(r"^(\d+)\.(\d+)(?![.\d])")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The N.M matcher has the same lookahead issue and fails on headings like 2.1. Definitions, so they also fall through to the bare N. pattern and become level 1 instead of level 2. This breaks parent-child hierarchy for agreements that punctuate subsection numbers with a trailing dot. [incorrect condition logic]

Severity Level: Major ⚠️
- ❌ Two-part subsections downgraded from level 2 to level 1.
- ⚠️ Children appear deeper than their mis-leveled parent.
- ⚠️ Consumers of level rubric misinterpret agreement structure.
Steps of Reproduction ✅
1. Run `main()` in `scripts/parse_doc2dict_with_config.py:4–26` over an EX-10 agreement
whose doc2dict section tree contains a heading like `"2.1. Definitions"` rendered as
`"2.1. Definitions"` (i.e., the section number `"2.1."` followed by a space and title
text). The existence of terminator periods immediately after section numbers such as
`"Section 2.1."` is documented in `src/clause_extract/agreement_config.py:10–13`, where a
`\.?` slot was added specifically to handle this punctuation.

2. In `parse_one()` (`scripts/parse_doc2dict_with_config.py:1–37`), `html2dict` builds
`document_tree`, and `walk_sections(document_tree, depth=0)` is invoked
(`scripts/parse_doc2dict_with_config.py:36–37`). As `walk_sections` iterates nodes
(`scripts/parse_doc2dict_with_config.py:38–75, 96–115`), it encounters the `"2.1.
Definitions"` section and calls `_remap_depth(title, cls, depth,
parent_level=parent_level)` at line 76.

3. `_remap_depth` calls `_infer_level_from_title(title or "")`
(`scripts/parse_doc2dict_with_config.py:21–23,165–175`), which tests `_LEVEL_PATTERNS` in
order (`scripts/parse_doc2dict_with_config.py:25–63`). For the title `"2.1. Definitions"`,
the two-part `N.M` matcher `(2, re.compile(r"^(\d+)\.(\d+)(?![.\d])"))` at line 155 fails
because immediately after the second digit `"1"` the next character is `"."`, which the
negative lookahead `(?![.\d])` explicitly forbids; thus the `N.M` pattern does not match
headings that punctuate the number as `"2.1."`.

4. After the `N.M` pattern fails, `_infer_level_from_title` proceeds to the bare `N.` rule
`(1, re.compile(r"^(\d+)\."))` at line 156, which matches only the `"2."` prefix of `"2.1.
Definitions"` and causes `_infer_level_from_title` to return level `1` instead of the
intended level `2`. `_remap_depth` then uses this incorrect level
(`scripts/parse_doc2dict_with_config.py:21–33, 96–103`), so the `"2.1. Definitions"`
header is emitted as a top-level body clause in the JSONL, while its lettered or numbered
children are assigned deeper levels. This breaks the intended parent-child hierarchy for
common two-part subsection headings that include a trailing period after the number.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 155:155
**Comment:**
	*Incorrect Condition Logic: The `N.M` matcher has the same lookahead issue and fails on headings like `2.1. Definitions`, so they also fall through to the bare `N.` pattern and become level 1 instead of level 2. This breaks parent-child hierarchy for agreements that punctuate subsection numbers with a trailing dot.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

@arthrod
Copy link
Copy Markdown
Owner Author

arthrod commented May 17, 2026

Triage agent — PR #76 comment review (read-only pass, no code changes)

4 inline comments reviewed (2 unique issues, raised by both Gemini and CodeAnt):

  1. gemini-code-assist + codeant-ai @ parse_doc2dict...py:147 — N.M.K lookahead (?![.\d]) rejects trailing dot (NEED-DISCUSSION)
    Both Gemini (HIGH) and CodeAnt (MAJOR) flag the same issue: (?![.\d]) blocks matching 2.1.1. Term Loans style headings, causing them to fall through to the bare N. rule as level 1. Proposed fix from both: change to (?!\.\d) to allow the trailing dot but still block longer numeric chains. This is a real correctness issue with a clear mechanical fix — but it touches the core _LEVEL_PATTERNS which drives every idx freeze. Arthur's judgment needed on whether to apply it immediately (the fix appears safe and non-regressing) or in a targeted polish PR after validating against all frozen idxs.

  2. gemini-code-assist + codeant-ai @ parse_doc2dict...py:155 — N.M lookahead same issue (NEED-DISCUSSION)
    Same root cause on the two-part N.M pattern. 2.1. Definitions would fall to level 1. Same fix applies. Bundled with the item above — one decision covers both.

NEED-DISCUSSION (1 logical issue, 2 affected patterns): Trailing-dot lookahead fix in _LEVEL_PATTERNS — verify against all 9 frozen idxs before applying.

Triage only — no code changes made this round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant