idx=6: freeze (69 records) — ARRIS Employment Agreement, IWW-in-body rescue + section-ref filter by arthrod · Pull Request #79 · arthrod/clause-extract

arthrod · 2026-05-17T10:52:56Z

User description

Summary

Seventh stacked PR. Adds idx=6 (EMPLOYMENT AGREEMENT between ARRIS Group, Inc. and Timothy O'Loughlin, January 4, 2016) as the seventh verified frozen baseline on top of idx=5 (PR #78).

Parser changes (2 surgical, shape-driven)

_split_iww_and_sig_from_body (new, ~lines 3414-3565) — defensive rescue for when doc2dict packs the entire sig page (IWW + party labels + /s/ + By/Name/Title) into the trailing body of the last numbered subsection. Guards: returns rows unchanged if any record already starts with IN WITNESS WHEREOF as a standalone span. When the guard passes and a body contains IWW mid-text, splits into prefix + IWW (L1) + per-line sig records (L2 marked _sig_line=True so the existing _consolidate_sig_lines_after_iww anchors them).

For idx=6 this is a no-op — doc2dict naturally gave 12 separate leaf nodes (1 IWW + 11 sig fragments). The function is a corpus-wide safety net for the OTHER pattern where doc2dict packs everything into one body. All 6 prior idxs verified untouched by the function.
Section-reference filter in _split_inline_section_markers (line ~2385): cross-references like "this Section 5." or "Sections 7, 8 and 9 hereof" were being picked up by the marker scanner as candidate section starts, breaking the gap-1 monotonic check. Added a 15-char lookbehind dropping any candidate preceded by \bSections?\s*[\xa0\s]*$. For idx=6 this reduced [5, 11, 12] → [11, 12] (clean monotonic), allowing Sections 11 and 12 to emit as their own L1 records.

Verified output for idx=6

69 records, distribution {L0:1, L1:14, L2:45, L3:9}
Reconstruction: word_coverage 99.4%, char_ratio 99.7%
Max depth: 3

Top-level structure

o=0   L0: EMPLOYMENT AGREEMENT
o=1   L1: THIS EMPLOYMENT AGREEMENT (the "Agreement"), dated as of 4 January, 2016 is by and between ARRIS GROUP, INC...
o=2   L1: 1. Employment
o=3-8 L2: (a)-(f) employment subsections
o=9   L1: 2. Compensation.
... (Sections 1-12 each at L1)
o=50  L1: 11. Arbitration—Exclusive Remedy           ← was buried before fix #2
o=51  L1: 12. Miscellaneous                          ← was buried before fix #2
o=57  L1: IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first above written.
o=58-68 L2: 11 sig-page fragments (doc2dict natural grouping)

Sig area (doc2dict's natural per-line fragmentation, preserved per rubric)

o=58 L2: "Company"
o=59 L2: ARRIS GROUP, INC.
o=60 L2: By:
o=61 L2: /s/ R.J. Stanzione
o=62 L2: Name:
o=63 L2: R.J. Stanzione
o=64 L2: Title:
o=65 L2: Chairman & CEO
o=66 L2: "Executive"
o=67 L2: /s/ Timothy O'Loughlin
o=68 L2: Timothy O'Loughlin

The 11-record sig split is doc2dict's natural HTML grouping (each line was a separate <p> element in the source). Per the rubric's "preserve doc2dict natural grouping at depth 2" rule, the parser preserves whatever doc2dict provides — does not impose per-line splitting and does not merge across parties. Inspector verified each of these 11 records corresponds to a separate cls='promoted text leaf' node in the raw parquet (node_ids 60-70).

Test plan

uv run scripts/parse_doc2dict_with_config.py --limit 7 --no-truncate --output-dir data/auto_parse exits 0 with ok 7
uv run scripts/level_loop/freeze.py 6 --force reports word_coverage ≥ 90% (99.4%)
uv run scripts/level_loop/regress.py reports all 7 frozen idxs OK
Inspector verified both fixes + independently confirmed sig split is doc2dict-driven, not parser-imposed
All 6 prior idxs (0-5) byte-identical regress after parser changes

Source

http://www.sec.gov/Archives/edgar/data/1645494/000119312517164198/d342430dex101.htm

Why this matters for the corpus

_split_iww_and_sig_from_body adds a defensive layer for the cross-corpus pattern where doc2dict packs the sig page into the last subsection's body instead of emitting it as separate nodes. Guarded by IWW-already-present check so it cannot regress agreements where the sig page is already structurally separate.
Section-reference filter prevents inline cross-references from being mistaken for section starts. Generalizes to any agreement that cites prior sections in body prose.

🤖 Generated with Claude Code

CodeAnt-AI Description

Parse agreements with embedded section headings and signature pages correctly

What Changed

Inline numbered sections that are mentioned as references, such as “Section 5” or “Sections 7, 8 and 9,” are no longer mistaken for new section headings
When the signature page is buried inside the last section’s text, it is now split out so the “IN WITNESS WHEREOF” clause and each signature line appear in the correct order
The ARRIS employment agreement is now frozen with the corrected section structure and signature block layout

Impact

✅ Fewer missing section splits
✅ Correct signature-page ordering
✅ More complete agreement extractions

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…ig page split + Section-ref filter for inline N. markers idx=6 is a Form 10 EX-10.1 Employment Agreement (ARRIS Group / Timothy O'Loughlin, 2016). doc2dict ships two structural pathologies in this filing: 1. The entire signature page (IWW operating clause + party labels + /s/ marks + By:/Name:/Title: fields) is packed into the trailing body text of "(j) Waiver of Breach" — the last lettered subsection of Section 12 (Miscellaneous). Result: zero L1 IWW record, zero L2 sig-line records. 2. Sections 11 (Arbitration—Exclusive Remedy) and 12 (Miscellaneous, with subsections (a)-(e) inline) are stuffed inside the preamble paragraph's body. Result: 12 visible top-level sections in source but only 10 L1 records in JSONL, with sections 11/12 buried. Both fixes are structural — no phrase blocklists. Change 1: new function _split_iww_and_sig_from_body (runs before _explode_signature_block_lines). Guard: returns rows unchanged if the document already has any record whose title/body STARTS with the IWW phrase (this is the case for all six prior idxs, so they are untouched). When the guard passes and a body contains IWW mid-text, the body is split into prefix + IWW operating clause + sig-page lines, the IWW becomes an L1 record, and each unique sig-page line becomes an L2 record marked `_sig_line=True` so `_consolidate_sig_lines_after_iww` anchors them after the IWW. Global per-block dedup collapses the doc2dict pathology where party labels, /s/ marks, and names are emitted twice. Change 2: in _split_inline_section_markers, filter out raw-marker candidates whose preceding 15 chars match `\\bSections?\\s*[\\xa0\\s]*$`. This rejects textual cross-references like "this Section 5." or "Sections 7, 8 and 9 hereof" that the marker regex otherwise picks up as section-start hits. For idx=6 nid=3 (the preamble), this drops the false-positive `5` and reduces the marker sequence to [11, 12] — a clean gap-1 monotonic run that triggers the existing split path, emitting Section 11 (Arbitration) and Section 12 (Miscellaneous) as their own L1 records. Bytewise-confirmed: idx=0..5 frozen output is unchanged. idx=6 freeze passes all gates (1 L0, max level 3, monotonic order, 99% word coverage, no boilerplate). Full regress (all 7 idxs) OK. Final shape: 69 records, levels {0: 1, 1: 14, 2: 45, 3: 9}. The 14 L1 records cover the preamble + Sections 1-12 + the IWW operating clause. The 45 L2 records cover lettered subsections and the per-line sig block. The 9 L3 records cover the deep roman/letter nesting under Sections 5 (General Release sub-items) and 6 (Good Reason sub-clauses). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

blocksorg · 2026-05-17T10:52:59Z

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

codeant-ai · 2026-05-17T10:53:00Z

CodeAnt AI is reviewing your PR.

qodo-code-review · 2026-05-17T10:53:00Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

sourcery-ai

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

coderabbitai · 2026-05-17T10:53:07Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added employment agreement dataset with 69 records including employment terms, severance provisions, change-in-control agreements, confidentiality clauses, and signature pages.
Improvements
- Enhanced section marker detection for improved document parsing accuracy.
- Improved extraction and separation of witness clauses and signature content from contract bodies.

Walkthrough

This PR adds a frozen employment-agreement dataset (idx_6) with 69 hierarchical text spans and strengthens the document parser. It refines inline section detection to avoid false splits on textual cross-references, introduces IWW/signature extraction to normalize misplaced structural content, and integrates both enhancements into the parsing pipeline.

Changes

Data, State, and Parser Enhancements

Layer / File(s)	Summary
Frozen idx_6 dataset and state updates `data/auto_parse/level_freeze/frozen/idx_6.jsonl`, `data/auto_parse/level_freeze/state.json`	Creates the full employment-agreement JSONL dataset for idx_6 (69 records with hierarchical levels and spans covering employment terms, severance, non-compete, arbitration, and signature content), marks idx_6 as frozen in the indices list, and appends freeze-event entries to the state history tracking the dataset at 55, 67, and 69 record milestones.
Inline section marker refinement `scripts/parse_doc2dict_with_config.py`	Refines `_split_inline_section_markers` with a two-stage context check: collects raw numbered markers ("N."), then filters out matches whose preceding tail text appears to be "Section(s)" cross-references rather than fresh section headings, preventing erroneous splits on in-text references.
IWW/signature extraction and pipeline wiring `scripts/parse_doc2dict_with_config.py`	Introduces `_split_iww_and_sig_from_body` to detect and extract "IN WITNESS WHEREOF" clauses embedded mid-body (guarded to skip when already correctly separated), splitting affected records into a kept prefix, an L1 IWW record, and deduplicated L2 signature-line records; wires the new pass into `parse_one` immediately before signature-line consolidation to ensure proper hierarchy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

arthrod/clause-extract#5: Both PRs update the same data/auto_parse/level_freeze workflow by adding a new frozen/idx_*.jsonl file and appending corresponding state.json history/current_idx freeze+advance entries (while only the main PR also changes scripts/parse_doc2dict_with_config.py).

Suggested labels

Feat2

Poem

🐰 A new dataset hops into the freeze,
With employment terms and severance decrees,
The parser now spots sections with care,
And gathers lost signatures from despair,
One hop closer to documents so clear! 📜✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR objectives mention linked issue `#2` but the changeset does not implement the three required fixes: dead code removal, brittle pattern replacement with regex constant, and ARTICLE anchor adjustment.	The PR must implement all three fixes from issue `#2`: remove _collect_table_text and _promote_text_blob, replace 'WITNESS' substring check with _IN_WITNESS_WHEREOF_RE constant, and drop $ anchor from ARTICLE pattern.
Out of Scope Changes check	⚠️ Warning	The changes to scripts/parse_doc2dict_with_config.py introduce two new functions (_split_iww_and_sig_from_body and section-ref filter) not mentioned in the linked issue `#2` requirements.	Clarify whether _split_iww_and_sig_from_body and section-ref filter are part of issue `#2` scope, or if they belong to a separate issue for idx=6 frozen baseline addition.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main addition (idx=6 freeze with 69 records) and highlights the two key parser fixes (IWW-in-body rescue and section-ref filter).
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the new frozen index, parser changes, verified output metrics, and testing performed.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request improves document parsing by correctly identifying section cross-references and extracting signature blocks that were previously merged into the final section's body. Feedback highlights a critical issue in the signature line deduplication logic, which would incorrectly drop standard labels for subsequent signers in multi-signer agreements. Additionally, it is recommended to move regex compilations to the module level to improve performance and maintain consistency across the parser.

gemini-code-assist · 2026-05-17T10:55:26Z

+        seen_lines: set[str] = set()
+        for line in lines:
+            if line in seen_lines:
+                continue  # collapse duplicate
+            seen_lines.add(line)


The global deduplication of signature lines using seen_lines is problematic. While it may address a specific doc2dict pathology where blocks are repeated, it will incorrectly drop common repeated field labels like "By:", "Name:", or "Title:" in multi-signer agreements if they appear as standalone lines (which is common in doc2dict output). This will result in missing labels for subsequent signers. Consider removing the deduplication or making it more selective (e.g., only dedupe if the entire sequence of lines is repeated).

gemini-code-assist · 2026-05-17T10:55:26Z

+        # enough for "Sections " variants without being so wide that it
+        # picks up unrelated mentions.
+        markers = []
+        _SECTION_REF_RE = re.compile(r"(?i)\bSections?\s*[\xa0\s]*$")


The regular expression _SECTION_REF_RE is compiled inside the loop for every record that contains potential markers. This is inefficient, especially for large documents. It should be moved to the module level or at least outside the rows loop to improve performance.

gemini-code-assist · 2026-05-17T10:55:26Z

+        if _is_iww_clause(title) or _is_iww_clause(body):
+            return rows
+
+    _IWW_INLINE_RE = re.compile(r"(?i)IN\s+WITNESS\s+WHEREOF")


The regular expression _IWW_INLINE_RE is compiled inside the function _split_iww_and_sig_from_body. For consistency with other parser functions and better performance, it should be moved to the module level.

codeant-ai · 2026-05-17T11:00:00Z

+        _SECTION_REF_RE = re.compile(r"(?i)\bSections?\s*[\xa0\s]*$")
+        for m in raw_markers:
+            pre = body[max(0, m.start() - 15):m.start()]
+            if _SECTION_REF_RE.search(pre):
+                continue
+            markers.append(m)


Suggestion: The new section-reference filter drops any marker preceded by Section/Sections, which also matches legitimate inline headings like Section 11. Notices. That causes real top-level sections embedded in body text to be skipped entirely and prevents the split pass from emitting them as separate records. Narrow this condition so it only excludes true cross-references (for example, require preceding words like this/under/pursuant to or punctuation patterns typical of prose references) instead of filtering all Section N. prefixes. [incorrect condition logic]

Severity Level: Critical 🚨

- ❌ Inline numbered sections never split into separate records. - ⚠️ Section 11/12 clauses remain buried inside prior body. - ⚠️ Downstream clause extractors miss affected L1 sections.

Steps of Reproduction ✅

1. In a test or REPL, construct a `rows` list with a single agreement record shaped like existing pipeline records (see `parse_one` at `scripts/parse_doc2dict_with_config.py:716-869`): `{"node_id": 1, "parent_node_id": 0, "scope": "agreement", "is_envelope": False, "body_direct": "… Section 11. Notices. … Section 12. Amendments. …", "body_direct_chars": len(body)}` where the body is >200 chars and both headings are embedded inline. 2. Call `_split_inline_section_markers(rows)` defined at `scripts/parse_doc2dict_with_config.py:2385-152` on this synthetic `rows` list; this function is also invoked from the real pipeline in `parse_one` at line `4099` (`sections = _split_inline_section_markers(sections)`), so this mirrors actual execution. 3. Inside `_split_inline_section_markers`, the regex at line `2419` (`raw_markers = list(re.finditer(r"(?:(?<=[.\s\xa0])|^)(\d+)\.\s*[A-Z]", body))`) finds inline markers for `11.` and `12.`; for each, the 15‑character lookbehind slice `pre = body[max(0, m.start() - 15):m.start()]` at line `2433` contains `"Section "` immediately before the digits. 4. The section-reference filter at lines `2431-2435` sees that `_SECTION_REF_RE = re.compile(r"(?i)\bSections?\s*[\xa0\s]*$")` matches this `pre` substring and so `continue`s, dropping both markers. Because `markers` stays empty, the function never splits the body into separate records, so the inline top-level headings for Sections 11 and 12 are never promoted to their own L1 records when `parse_one` runs over such a document.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** scripts/parse_doc2dict_with_config.py **Line:** 2431:2436 **Comment:** *Incorrect Condition Logic: The new section-reference filter drops any marker preceded by `Section`/`Sections`, which also matches legitimate inline headings like `Section 11. Notices`. That causes real top-level sections embedded in body text to be skipped entirely and prevents the split pass from emitting them as separate records. Narrow this condition so it only excludes true cross-references (for example, require preceding words like `this`/`under`/`pursuant to` or punctuation patterns typical of prose references) instead of filtering all `Section N.` prefixes. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T11:00:01Z

+        end_period = iww_and_after.find(".", m.end() - m.start())
+        if end_period == -1:
+            # No period — treat whole text as IWW (no sig lines parsed
+            # separately). Skip; better to leave the body intact than
+            # emit a malformed record.
+            continue
+        iww_sentence = iww_and_after[:end_period + 1].strip()
+        sig_rest = iww_and_after[end_period + 1:]


Suggestion: The IWW split uses the first . after IN WITNESS WHEREOF as the sentence terminator, which is unsafe for legal text containing abbreviations (for example Inc. or month abbreviations) before the actual clause end. This truncates the IWW clause and incorrectly pushes the remainder into signature-line parsing. Use a more robust boundary (e.g., end-of-line clause boundary or a stricter terminator pattern) so abbreviations do not prematurely end the clause. [logic error]

Severity Level: Critical 🚨

- ❌ IWW clause truncated at abbreviations like "Inc.". - ⚠️ Remaining IWW sentence mis-labeled as signature lines. - ⚠️ Signature-page hierarchy inaccurate for affected agreements.

Steps of Reproduction ✅

1. Construct a synthetic `rows` list representing a late-stage agreement record as produced in `parse_one` (see `parse_one` at `scripts/parse_doc2dict_with_config.py:716-869`): include a record `r` with `scope="agreement"`, `is_envelope=False`, and `body_direct` containing inline IWW + sig text such as `"… (j) Waiver. IN WITNESS WHEREOF, ULURU Inc. has caused this Agreement to be executed as of the date first above written.\nBy: /s/ …"`, ensuring no other record's title/body starts with `IN WITNESS WHEREOF`. 2. Call `_split_iww_and_sig_from_body(rows)` defined at `scripts/parse_doc2dict_with_config.py:115-266`; in the real pipeline this is invoked from `parse_one` at line `4127` (`sections = _split_iww_and_sig_from_body(sections)`), so this call mirrors how the parser runs on actual filings. 3. In `_split_iww_and_sig_from_body`, the guard loop at lines `153-162` finds no existing IWW-leading record and returns the original `rows` unchanged up to that point, then `_IWW_INLINE_RE = re.compile(r"(?i)IN\s+WITNESS\s+WHEREOF")` at line `164` finds the inline IWW phrase inside `r["body_direct"]` (match `m`). 4. For this body, `iww_and_after = body[m.start():]` and `end_period = iww_and_after.find(".", m.end() - m.start())` at line `3488` returns the position of the period in the abbreviation `"Inc."` (the first `.` after the phrase), so `iww_sentence = iww_and_after[:end_period + 1]` at line `3494` becomes `"IN WITNESS WHEREOF, ULURU Inc."` and the remainder `" has caused this Agreement to be executed …"` is assigned to `sig_rest` and fed into `_split_sig_block_body_into_lines(sig_rest)` at line `240`. As a result, the main IWW operating clause text after the corporate abbreviation is misclassified as signature-line content rather than part of the IWW L1 record when `parse_one` processes such a document.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** scripts/parse_doc2dict_with_config.py **Line:** 3488:3495 **Comment:** *Logic Error: The IWW split uses the first `.` after `IN WITNESS WHEREOF` as the sentence terminator, which is unsafe for legal text containing abbreviations (for example `Inc.` or month abbreviations) before the actual clause end. This truncates the IWW clause and incorrectly pushes the remainder into signature-line parsing. Use a more robust boundary (e.g., end-of-line clause boundary or a stricter terminator pattern) so abbreviations do not prematurely end the clause. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T11:00:01Z

+        seen_lines: set[str] = set()
+        for line in lines:
+            if line in seen_lines:
+                continue  # collapse duplicate
+            seen_lines.add(line)


Suggestion: Global deduplication of signature lines removes repeated lines anywhere in the block, but repeated values are valid across multiple signers (for example identical By:/Title: lines). This causes data loss by dropping legitimate signer rows. Restrict deduplication to clearly duplicated adjacent lines from parser duplication artifacts, or keep all lines and let downstream grouping handle repeats. [logic error]

Severity Level: Major ⚠️

- ⚠️ Repeated signature titles dropped for later signers. - ⚠️ Multi-signer blocks lose structurally valid duplicated lines. - ⚠️ Downstream sig analysis sees incomplete signer attributes.

Steps of Reproduction ✅

1. Use the same `_split_iww_and_sig_from_body` pipeline from `scripts/parse_doc2dict_with_config.py:115-266` (called from `parse_one` at line `4127`), but construct a `rows` record where the inline sig-page tail `sig_rest` includes two different signers whose lines share identical text, e.g. `"By:\nTitle: Director\nBy:\nTitle: Director\n"` so that both signers have the same `Title:` line. 2. When `_split_iww_and_sig_from_body` reaches lines `240-246`, it calls `_split_sig_block_body_into_lines(sig_rest)` at line `240`, which returns `lines` as `["By:", "Title: Director", "By:", "Title: Director"]` after trimming whitespace and dropping footer chrome. 3. The deduplication loop at lines `3540-3544` initializes `seen_lines: set[str] = set()` and then iterates `for line in lines:`; the first `"Title: Director"` is added to `seen_lines`, but when the second signer's `"Title: Director"` is encountered, the `if line in seen_lines: continue` branch is taken, so that duplicate line is skipped and never emitted as a separate sig-line record. 4. Downstream, `_explode_signature_block_lines` at `scripts/parse_doc2dict_with_config.py:269-555` and `_merge_body_only_sig_fragments_in_order` at `558-615` operate only on the deduplicated set; the second signer's title line has been lost permanently, so any consumers of `parse_one`'s `sections` output (e.g. the Parquet written in `main` at `872-895`) see incomplete signature information when there are structurally valid repeated lines across signers.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** scripts/parse_doc2dict_with_config.py **Line:** 3540:3544 **Comment:** *Logic Error: Global deduplication of signature lines removes repeated lines anywhere in the block, but repeated values are valid across multiple signers (for example identical `By:`/`Title:` lines). This causes data loss by dropping legitimate signer rows. Restrict deduplication to clearly duplicated adjacent lines from parser duplication artifacts, or keep all lines and let downstream grouping handle repeats. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix

👍 | 👎

codeant-ai · 2026-05-17T11:00:06Z

CodeAnt AI finished reviewing your PR.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 4120-4127: Add a unit/freeze fixture that exercises the
IWW-in-body rescue by constructing a parser input where a body node contains the
"IN WITNESS WHEREOF" phrase mid-body followed by signature lines so
_split_iww_and_sig_from_body actually performs a split; update or add a test
(targeting the parser run that produces the sections list and the post-split
structure) to assert that an IWW L1 node is emitted and that each signature line
becomes L2 children in the expected order and parentage, ensuring the case is
not a no-op (idx=6) and fails until the split code path runs.
- Around line 3452-3465: There are three separate IN WITNESS WHEREOF matchers
(_IWW_INLINE_RE in this function, the _is_iww_clause() helper, and logic in
_find_signature_cutoff); consolidate them by defining a single module-level
compiled regex (e.g., IWW_RE) and update all call sites to use it: replace the
local _IWW_INLINE_RE with the module-level name in
parse_doc2dict_with_config.py, refactor _is_iww_clause() to consult IWW_RE, and
change the fallback/cutoff logic inside _find_signature_cutoff to use IWW_RE as
well so the phrase matching is compiled once and consistent across
_is_iww_clause, _find_signature_cutoff, and the loop that checks rows.
- Around line 3501-3525: The synthetic IWW node is being attached to
r.get("parent_node_id") (the contaminated subsection) which yields an
inconsistent tree; instead set iww_record["parent_node_id"] to the nearest
ancestor at the current subdoc container level (i.e., the node whose depth ==
l1_depth - 1). To fix: before creating iww_record, compute container_root_id by
walking the parent chain (using existing row parent_node_id values in r and/or
new_rows) until you find a node with depth == l1_depth - 1, then assign
parent_node_id = container_root_id rather than r.get("parent_node_id"); keep
other fields (subdoc_penalty, l1_depth, iww_node_id, next_id) unchanged.
- Around line 3539-3544: The current loop over lines uses seen_lines to globally
dedupe duplicates which removes non-consecutive repeats; change it to only
collapse adjacent duplicate signature lines by replacing the seen_lines set
logic with a previous-line check: iterate through lines produced by
_split_sig_block_body_into_lines(sig_rest), keep a variable like prev_line
(initially None), skip the current line only if line == prev_line (to collapse
consecutive duplicates), otherwise emit/add the line and set prev_line = line;
update the code paths around variables lines, line and seen_lines accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 724a85b1-5c7d-4891-a2f1-d73263ddf14c

📥 Commits

Reviewing files that changed from the base of the PR and between a6e0d87 and 79dcef3.

📒 Files selected for processing (3)

data/auto_parse/level_freeze/frozen/idx_6.jsonl
data/auto_parse/level_freeze/state.json
scripts/parse_doc2dict_with_config.py

📜 Review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

scripts/parse_doc2dict_with_config.py

**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

scripts/parse_doc2dict_with_config.py

🔍 Remote MCP

Summary of Relevant Context Found

Based on my research across available tools, here's the additional context relevant to this PR review:

Document Source & Context

The ARRIS Employment Agreement is between ARRIS Group, Inc. and Timothy O'Loughlin (President, North American Sales). O'Loughlin spent 18 years as President of the Americas Business Unit and ran Americas Sales and global marketing for ARRIS prior to its sale to CommScope, establishing the executive-level nature of this employment agreement.

"IN WITNESS WHEREOF" Clause Context

The PR's defensive handling of IWW clauses is contextually important. "IN WITNESS WHEREOF" serves as the formal conclusion to a contract or legal document, indicating that the parties have agreed to the terms, and typically precedes the signature blocks where each party signs and dates the document. "In witness whereof" is an old legal phrase signaling the signing section of a contract—not legally required, but still widely used, and appears in formal closing phrases in contracts, deeds, and other legal documents that signal the end of the agreement's terms and introduces the signature block. This underscores why the _split_iww_and_sig_from_body function acts as a corpus-wide safety net when documents have embedded IWW content within body text rather than as a standalone structural element.

Section Detection & Regex Patterns

The PR's refinement to filter out textual cross-references (e.g., "this Section 5") aligns with broader document parsing challenges. Document parsers handle numerous formatting variations by trying multiple regex patterns and falling back gracefully. Parsers identify unique identifiers or paragraph markers, so patterns like "Part I: Blah Blah" become coded section references, and lettered subsections follow hierarchical patterns. The 15-character lookbehind check in _split_inline_section_markers prevents false positives from inline references—a pragmatic solution to a common parsing problem.

Verification Status

The PR demonstrates comprehensive regression testing: the output shows 69 records with documented hierarchical structure (L0:1, L1:14, L2:45, L3:9), word coverage at 99.4%, character ratio at 99.7%, and all prior frozen indices (0–5) remain byte-identical after parser changes. This validates both the new preprocessing passes and the section-reference filter changes.

Note on User Requirements: The user requirements mention "Stitch" tools for design recommendations; however, these tools are not available in the current toolset. This PR is a code/parser review, not a design task, so those requirements do not apply.

🔇 Additional comments (4)

data/auto_parse/level_freeze/frozen/idx_6.jsonl (1)

1-69: LGTM!

data/auto_parse/level_freeze/state.json (2)

9-10: LGTM!

143-166: LGTM!

scripts/parse_doc2dict_with_config.py (1)

2419-2436: LGTM!

coderabbitai · 2026-05-17T11:00:08Z

+    # Guard: only act when no existing record carries IWW as leading text.
+    # Without this, we'd double-split documents that already have a clean
+    # IWW record (the prior idxs) and break their frozen baselines.
+    for r in rows:
+        if r.get("is_envelope") or r.get("scope") == "trailer":
+            continue
+        title = (r.get("title") or "").strip()
+        body = (r.get("body_direct") or "").strip()
+        if _is_iww_clause(title) or _is_iww_clause(body):
+            return rows
+
+    _IWW_INLINE_RE = re.compile(r"(?i)IN\s+WITNESS\s+WHEREOF")
+
+    next_id = max((r["node_id"] for r in rows), default=-1) + 1


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Centralize the IWW matcher instead of adding another local regex.

This path introduces a third definition of “IN WITNESS WHEREOF” matching (_IWW_INLINE_RE here, _is_iww_clause(), and the separate fallback logic in _find_signature_cutoff). That leaves the rescue, cutoff detection, and L1 pinning free to drift again. Please promote this phrase to one module-level compiled regex and route all call sites through it.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 3452 - 3465, There are three separate IN WITNESS WHEREOF matchers (_IWW_INLINE_RE in this function, the _is_iww_clause() helper, and logic in _find_signature_cutoff); consolidate them by defining a single module-level compiled regex (e.g., IWW_RE) and update all call sites to use it: replace the local _IWW_INLINE_RE with the module-level name in parse_doc2dict_with_config.py, refactor _is_iww_clause() to consult IWW_RE, and change the fallback/cutoff logic inside _find_signature_cutoff to use IWW_RE as well so the phrase matching is compiled once and consistent across _is_iww_clause, _find_signature_cutoff, and the loop that checks rows.

coderabbitai · 2026-05-17T11:00:08Z

+        subdoc_penalty = r.get("subdoc_penalty") or 0
+        l1_depth = 1 + subdoc_penalty
+        l2_depth = 2 + subdoc_penalty
+
+        # Emit the IWW operating clause as L1.
+        iww_record = {
+            "node_id": next_id,
+            "parent_node_id": r.get("parent_node_id"),
+            "depth": l1_depth,
+            "doc2dict_section_key": f"_iww_{next_id}",
+            "cls": "promoted text leaf",
+            "title": "",
+            "standardized_title": None,
+            "n_direct_section_children": 0,
+            "body_direct_chars": len(iww_sentence),
+            "body_recursive_chars": len(iww_sentence),
+            "body_direct": iww_sentence,
+            "body_recursive": iww_sentence,
+            "subdoc_penalty": subdoc_penalty,
+            "is_envelope": False,
+            "scope": "agreement",
+        }
+        new_rows.append(iww_record)
+        next_id += 1
+        iww_node_id = iww_record["node_id"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reparent the synthetic IWW node to the agreement/subdoc root.

parent_node_id is copied from the contaminated subsection, so when this rescue fires the new depth-1 IWW row becomes a child of a deeper clause. That leaves the tree internally inconsistent and gives every emitted sig-line the wrong lineage in parquet/downstream tree-based passes. The parent here should be the nearest ancestor at the current subdoc_penalty container level, not r.get("parent_node_id").

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 3501 - 3525, The synthetic IWW node is being attached to r.get("parent_node_id") (the contaminated subsection) which yields an inconsistent tree; instead set iww_record["parent_node_id"] to the nearest ancestor at the current subdoc container level (i.e., the node whose depth == l1_depth - 1). To fix: before creating iww_record, compute container_root_id by walking the parent chain (using existing row parent_node_id values in r and/or new_rows) until you find a node with depth == l1_depth - 1, then assign parent_node_id = container_root_id rather than r.get("parent_node_id"); keep other fields (subdoc_penalty, l1_depth, iww_node_id, next_id) unchanged.

coderabbitai · 2026-05-17T11:00:08Z

+        lines = _split_sig_block_body_into_lines(sig_rest)
+        seen_lines: set[str] = set()
+        for line in lines:
+            if line in seen_lines:
+                continue  # collapse duplicate
+            seen_lines.add(line)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only collapse adjacent duplicate signature lines.

The docstring says “duplicate consecutive lines are collapsed”, but the seen_lines set removes any repeated line anywhere in the block. That will erase legitimate later lines when two signers share the same company label or field text. Collapse only back-to-back duplicates here.

Suggested fix

- seen_lines: set[str] = set() + prev_line: str | None = None for line in lines: - if line in seen_lines: + if line == prev_line: continue # collapse duplicate - seen_lines.add(line) + prev_line = line new_rows.append({

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

lines = _split_sig_block_body_into_lines(sig_rest)

seen_lines: set[str] = set()

for line in lines:

if line in seen_lines:

continue # collapse duplicate

seen_lines.add(line)

lines = _split_sig_block_body_into_lines(sig_rest)

prev_line: str | None = None

for line in lines:

if line == prev_line:

continue # collapse duplicate

prev_line = line

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 3539 - 3544, The current loop over lines uses seen_lines to globally dedupe duplicates which removes non-consecutive repeats; change it to only collapse adjacent duplicate signature lines by replacing the seen_lines set logic with a previous-line check: iterate through lines produced by _split_sig_block_body_into_lines(sig_rest), keep a variable like prev_line (initially None), skip the current line only if line == prev_line (to collapse consecutive duplicates), otherwise emit/add the line and set prev_line = line; update the code paths around variables lines, line and seen_lines accordingly.

coderabbitai · 2026-05-17T11:00:08Z

+    # Split inline "IN WITNESS WHEREOF" + signature page out of a body.
+    # doc2dict sometimes packs the entire sig page into the trailing body
+    # of the LAST lettered subsection when the HTML walker couldn't
+    # promote it. This pass detects that pathology STRUCTURALLY (no
+    # existing IWW record in the document, IWW phrase appearing mid-body)
+    # and emits IWW as L1 + each sig-page line as L2 — restoring the
+    # title-as-root sig-page hierarchy.
+    sections = _split_iww_and_sig_from_body(sections)


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add a fixture that actually exercises the IWW-in-body rescue.

Per the PR summary, idx=6 is a no-op here, so this new safety net can merge without a regression that proves the split path, parentage, and sig-line ordering. Please add or point to a targeted parser/freeze case where a body node contains mid-body IWW plus signature lines.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/parse_doc2dict_with_config.py` around lines 4120 - 4127, Add a unit/freeze fixture that exercises the IWW-in-body rescue by constructing a parser input where a body node contains the "IN WITNESS WHEREOF" phrase mid-body followed by signature lines so _split_iww_and_sig_from_body actually performs a split; update or add a test (targeting the parser run that produces the sections list and the post-split structure) to assert that an IWW L1 node is emitted and that each signature line becomes L2 children in the expected order and parentage, ensuring the case is not a no-op (idx=6) and fails until the split code path runs.

arthrod · 2026-05-17T11:55:20Z

Triage agent — PR #79 comment review (read-only pass, no code changes)

8 inline comments reviewed:

gemini-code-assist @ parse_doc2dict...py:3544 — global seen_lines dedup drops valid repeated sig fields (WILL-DEFER)
Critical observation shared by both Gemini and CodeRabbit: seen_lines set removes any line appearing anywhere in the block, but By:, Name:, Title: are legitimately repeated per signer. Should restrict to adjacent-duplicate collapsing. Deferred (requires fixture with multi-signer block to validate fix).
gemini-code-assist @ parse_doc2dict...py:2431 — _SECTION_REF_RE compiled inside loop (WILL-DEFER)
Performance issue: compiling a regex in a per-record loop. Should be module-level. Low urgency — Python caches compiled regexes via re.compile but explicit module-level is cleaner. Deferred to a performance/cleanup pass.
gemini-code-assist @ parse_doc2dict...py:3463 — _IWW_INLINE_RE compiled inside function (WILL-DEFER)
Same as above — module-level regex placement. Deferred.
codeant-ai @ parse_doc2dict...py:2436 — section-ref filter also drops valid inline headings like Section 11. Notices (WILL-DEFER)
Real edge case: filtering on Section/Sections prefix catches cross-references but would also skip legitimate heading lines starting with "Section N." Needs a tighter pattern (e.g., check for verb/preposition context). Deferred.
codeant-ai @ parse_doc2dict...py:3495 — IWW split uses first . as terminator, unsafe for abbreviations (WILL-DEFER)
Inc. or month abbreviations before the actual clause end would truncate the IWW clause. A more robust sentence splitter (e.g., lookahead for space + capital after .) would help. Deferred.
codeant-ai @ parse_doc2dict...py:3544 — global dedup drops legitimate repeated lines (WILL-DEFER)
Same as Gemini item 1. Both agree. Consolidated.
coderabbitai @ parse_doc2dict...py:3465 — centralize IWW matcher, 3 definitions now (WILL-DEFER)
Refactor: _IWW_INLINE_RE, _is_iww_clause(), and fallback in _find_signature_cutoff are three separate IWW detectors. Centralize into one. Deferred.
coderabbitai @ parse_doc2dict...py:3525 — synthetic IWW node copies parent_node_id from contaminated subsection (WILL-DEFER)
The new depth-1 IWW row inherits the wrong parent. Should be reparented to the agreement/subdoc root. Real tree-integrity issue but no active failing idx. Deferred.
coderabbitai @ parse_doc2dict...py:4127 — no fixture for IWW-in-body rescue path (WILL-DEFER)
idx=6 is a no-op, so the rescue path has no test coverage. Add a fixture or doc that exercises the split path. Deferred.

WILL-DEFER items (8 distinct): global sig-line dedup → adjacent-only; 2x module-level regex hoisting; section-ref filter false positives; IWW . splitter robustness; 3x IWW matcher centralization; IWW node parent_node_id fix; IWW rescue fixture.

Triage only — no code changes made this round.

sourcery-ai Bot reviewed May 17, 2026

View reviewed changes

codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026

coderabbitai Bot added the Feat2 label May 17, 2026

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

codeant-ai Bot reviewed May 17, 2026

View reviewed changes

coderabbitai Bot requested changes May 17, 2026

View reviewed changes

arthrod mentioned this pull request May 17, 2026

idx=7: freeze (25 records) — Ingevity/WestRock Amendment No.1, multi-line title + page-footer exhibit strip #80

Open

4 tasks

Conversation

arthrod commented May 17, 2026 • edited by codeant-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Parser changes (2 surgical, shape-driven)

Verified output for idx=6

Top-level structure

Sig area (doc2dict's natural per-line fragmentation, preserved per rubric)

Test plan

Source

Why this matters for the corpus

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

blocksorg Bot commented May 17, 2026

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

qodo-code-review Bot commented May 17, 2026

Qodo reviews are paused for this user.

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

codeant-ai Bot commented May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Summary of Relevant Context Found

Document Source & Context

"IN WITNESS WHEREOF" Clause Context

Section Detection & Regex Patterns

Verification Status

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

arthrod commented May 17, 2026 •

edited by codeant-ai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading