idx=6: freeze (69 records) — ARRIS Employment Agreement, IWW-in-body rescue + section-ref filter#79
idx=6: freeze (69 records) — ARRIS Employment Agreement, IWW-in-body rescue + section-ref filter#79arthrod wants to merge 1 commit into
Conversation
…ig page split + Section-ref filter for inline N. markers
idx=6 is a Form 10 EX-10.1 Employment Agreement (ARRIS Group / Timothy
O'Loughlin, 2016). doc2dict ships two structural pathologies in this
filing:
1. The entire signature page (IWW operating clause + party labels +
/s/ marks + By:/Name:/Title: fields) is packed into the trailing
body text of "(j) Waiver of Breach" — the last lettered subsection
of Section 12 (Miscellaneous). Result: zero L1 IWW record, zero L2
sig-line records.
2. Sections 11 (Arbitration—Exclusive Remedy) and 12 (Miscellaneous,
with subsections (a)-(e) inline) are stuffed inside the preamble
paragraph's body. Result: 12 visible top-level sections in source
but only 10 L1 records in JSONL, with sections 11/12 buried.
Both fixes are structural — no phrase blocklists.
Change 1: new function _split_iww_and_sig_from_body (runs before
_explode_signature_block_lines). Guard: returns rows unchanged if the
document already has any record whose title/body STARTS with the IWW
phrase (this is the case for all six prior idxs, so they are
untouched). When the guard passes and a body contains IWW mid-text,
the body is split into prefix + IWW operating clause + sig-page lines,
the IWW becomes an L1 record, and each unique sig-page line becomes an
L2 record marked `_sig_line=True` so `_consolidate_sig_lines_after_iww`
anchors them after the IWW. Global per-block dedup collapses the
doc2dict pathology where party labels, /s/ marks, and names are
emitted twice.
Change 2: in _split_inline_section_markers, filter out raw-marker
candidates whose preceding 15 chars match `\\bSections?\\s*[\\xa0\\s]*$`.
This rejects textual cross-references like "this Section 5." or
"Sections 7, 8 and 9 hereof" that the marker regex otherwise picks up
as section-start hits. For idx=6 nid=3 (the preamble), this drops the
false-positive `5` and reduces the marker sequence to [11, 12] — a
clean gap-1 monotonic run that triggers the existing split path,
emitting Section 11 (Arbitration) and Section 12 (Miscellaneous) as
their own L1 records.
Bytewise-confirmed: idx=0..5 frozen output is unchanged. idx=6 freeze
passes all gates (1 L0, max level 3, monotonic order, 99% word
coverage, no boilerplate). Full regress (all 7 idxs) OK.
Final shape: 69 records, levels {0: 1, 1: 14, 2: 45, 3: 9}. The 14 L1
records cover the preamble + Sections 1-12 + the IWW operating clause.
The 45 L2 records cover lettered subsections and the per-line sig
block. The 9 L3 records cover the deep roman/letter nesting under
Sections 5 (General Release sub-items) and 6 (Good Reason
sub-clauses).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
|
CodeAnt AI is reviewing your PR. |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR adds a frozen employment-agreement dataset (idx_6) with 69 hierarchical text spans and strengthens the document parser. It refines inline section detection to avoid false splits on textual cross-references, introduces IWW/signature extraction to normalize misplaced structural content, and integrates both enhancements into the parsing pipeline. ChangesData, State, and Parser Enhancements
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Code Review
This pull request improves document parsing by correctly identifying section cross-references and extracting signature blocks that were previously merged into the final section's body. Feedback highlights a critical issue in the signature line deduplication logic, which would incorrectly drop standard labels for subsequent signers in multi-signer agreements. Additionally, it is recommended to move regex compilations to the module level to improve performance and maintain consistency across the parser.
| seen_lines: set[str] = set() | ||
| for line in lines: | ||
| if line in seen_lines: | ||
| continue # collapse duplicate | ||
| seen_lines.add(line) |
There was a problem hiding this comment.
The global deduplication of signature lines using seen_lines is problematic. While it may address a specific doc2dict pathology where blocks are repeated, it will incorrectly drop common repeated field labels like "By:", "Name:", or "Title:" in multi-signer agreements if they appear as standalone lines (which is common in doc2dict output). This will result in missing labels for subsequent signers. Consider removing the deduplication or making it more selective (e.g., only dedupe if the entire sequence of lines is repeated).
| # enough for "Sections " variants without being so wide that it | ||
| # picks up unrelated mentions. | ||
| markers = [] | ||
| _SECTION_REF_RE = re.compile(r"(?i)\bSections?\s*[\xa0\s]*$") |
There was a problem hiding this comment.
| if _is_iww_clause(title) or _is_iww_clause(body): | ||
| return rows | ||
|
|
||
| _IWW_INLINE_RE = re.compile(r"(?i)IN\s+WITNESS\s+WHEREOF") |
| _SECTION_REF_RE = re.compile(r"(?i)\bSections?\s*[\xa0\s]*$") | ||
| for m in raw_markers: | ||
| pre = body[max(0, m.start() - 15):m.start()] | ||
| if _SECTION_REF_RE.search(pre): | ||
| continue | ||
| markers.append(m) |
There was a problem hiding this comment.
Suggestion: The new section-reference filter drops any marker preceded by Section/Sections, which also matches legitimate inline headings like Section 11. Notices. That causes real top-level sections embedded in body text to be skipped entirely and prevents the split pass from emitting them as separate records. Narrow this condition so it only excludes true cross-references (for example, require preceding words like this/under/pursuant to or punctuation patterns typical of prose references) instead of filtering all Section N. prefixes. [incorrect condition logic]
Severity Level: Critical 🚨
- ❌ Inline numbered sections never split into separate records.
- ⚠️ Section 11/12 clauses remain buried inside prior body.
- ⚠️ Downstream clause extractors miss affected L1 sections.Steps of Reproduction ✅
1. In a test or REPL, construct a `rows` list with a single agreement record shaped like
existing pipeline records (see `parse_one` at
`scripts/parse_doc2dict_with_config.py:716-869`): `{"node_id": 1, "parent_node_id": 0,
"scope": "agreement", "is_envelope": False, "body_direct": "… Section 11. Notices. …
Section 12. Amendments. …", "body_direct_chars": len(body)}` where the body is >200 chars
and both headings are embedded inline.
2. Call `_split_inline_section_markers(rows)` defined at
`scripts/parse_doc2dict_with_config.py:2385-152` on this synthetic `rows` list; this
function is also invoked from the real pipeline in `parse_one` at line `4099` (`sections =
_split_inline_section_markers(sections)`), so this mirrors actual execution.
3. Inside `_split_inline_section_markers`, the regex at line `2419` (`raw_markers =
list(re.finditer(r"(?:(?<=[.\s\xa0])|^)(\d+)\.\s*[A-Z]", body))`) finds inline markers for
`11.` and `12.`; for each, the 15‑character lookbehind slice `pre = body[max(0, m.start()
- 15):m.start()]` at line `2433` contains `"Section "` immediately before the digits.
4. The section-reference filter at lines `2431-2435` sees that `_SECTION_REF_RE =
re.compile(r"(?i)\bSections?\s*[\xa0\s]*$")` matches this `pre` substring and so
`continue`s, dropping both markers. Because `markers` stays empty, the function never
splits the body into separate records, so the inline top-level headings for Sections 11
and 12 are never promoted to their own L1 records when `parse_one` runs over such a
document.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 2431:2436
**Comment:**
*Incorrect Condition Logic: The new section-reference filter drops any marker preceded by `Section`/`Sections`, which also matches legitimate inline headings like `Section 11. Notices`. That causes real top-level sections embedded in body text to be skipped entirely and prevents the split pass from emitting them as separate records. Narrow this condition so it only excludes true cross-references (for example, require preceding words like `this`/`under`/`pursuant to` or punctuation patterns typical of prose references) instead of filtering all `Section N.` prefixes.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| end_period = iww_and_after.find(".", m.end() - m.start()) | ||
| if end_period == -1: | ||
| # No period — treat whole text as IWW (no sig lines parsed | ||
| # separately). Skip; better to leave the body intact than | ||
| # emit a malformed record. | ||
| continue | ||
| iww_sentence = iww_and_after[:end_period + 1].strip() | ||
| sig_rest = iww_and_after[end_period + 1:] |
There was a problem hiding this comment.
Suggestion: The IWW split uses the first . after IN WITNESS WHEREOF as the sentence terminator, which is unsafe for legal text containing abbreviations (for example Inc. or month abbreviations) before the actual clause end. This truncates the IWW clause and incorrectly pushes the remainder into signature-line parsing. Use a more robust boundary (e.g., end-of-line clause boundary or a stricter terminator pattern) so abbreviations do not prematurely end the clause. [logic error]
Severity Level: Critical 🚨
- ❌ IWW clause truncated at abbreviations like "Inc.".
- ⚠️ Remaining IWW sentence mis-labeled as signature lines.
- ⚠️ Signature-page hierarchy inaccurate for affected agreements.Steps of Reproduction ✅
1. Construct a synthetic `rows` list representing a late-stage agreement record as
produced in `parse_one` (see `parse_one` at
`scripts/parse_doc2dict_with_config.py:716-869`): include a record `r` with
`scope="agreement"`, `is_envelope=False`, and `body_direct` containing inline IWW + sig
text such as `"… (j) Waiver. IN WITNESS WHEREOF, ULURU Inc. has caused this Agreement to
be executed as of the date first above written.\nBy: /s/ …"`, ensuring no other record's
title/body starts with `IN WITNESS WHEREOF`.
2. Call `_split_iww_and_sig_from_body(rows)` defined at
`scripts/parse_doc2dict_with_config.py:115-266`; in the real pipeline this is invoked from
`parse_one` at line `4127` (`sections = _split_iww_and_sig_from_body(sections)`), so this
call mirrors how the parser runs on actual filings.
3. In `_split_iww_and_sig_from_body`, the guard loop at lines `153-162` finds no existing
IWW-leading record and returns the original `rows` unchanged up to that point, then
`_IWW_INLINE_RE = re.compile(r"(?i)IN\s+WITNESS\s+WHEREOF")` at line `164` finds the
inline IWW phrase inside `r["body_direct"]` (match `m`).
4. For this body, `iww_and_after = body[m.start():]` and `end_period =
iww_and_after.find(".", m.end() - m.start())` at line `3488` returns the position of the
period in the abbreviation `"Inc."` (the first `.` after the phrase), so `iww_sentence =
iww_and_after[:end_period + 1]` at line `3494` becomes `"IN WITNESS WHEREOF, ULURU Inc."`
and the remainder `" has caused this Agreement to be executed …"` is assigned to
`sig_rest` and fed into `_split_sig_block_body_into_lines(sig_rest)` at line `240`. As a
result, the main IWW operating clause text after the corporate abbreviation is
misclassified as signature-line content rather than part of the IWW L1 record when
`parse_one` processes such a document.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 3488:3495
**Comment:**
*Logic Error: The IWW split uses the first `.` after `IN WITNESS WHEREOF` as the sentence terminator, which is unsafe for legal text containing abbreviations (for example `Inc.` or month abbreviations) before the actual clause end. This truncates the IWW clause and incorrectly pushes the remainder into signature-line parsing. Use a more robust boundary (e.g., end-of-line clause boundary or a stricter terminator pattern) so abbreviations do not prematurely end the clause.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| seen_lines: set[str] = set() | ||
| for line in lines: | ||
| if line in seen_lines: | ||
| continue # collapse duplicate | ||
| seen_lines.add(line) |
There was a problem hiding this comment.
Suggestion: Global deduplication of signature lines removes repeated lines anywhere in the block, but repeated values are valid across multiple signers (for example identical By:/Title: lines). This causes data loss by dropping legitimate signer rows. Restrict deduplication to clearly duplicated adjacent lines from parser duplication artifacts, or keep all lines and let downstream grouping handle repeats. [logic error]
Severity Level: Major ⚠️
- ⚠️ Repeated signature titles dropped for later signers.
- ⚠️ Multi-signer blocks lose structurally valid duplicated lines.
- ⚠️ Downstream sig analysis sees incomplete signer attributes.Steps of Reproduction ✅
1. Use the same `_split_iww_and_sig_from_body` pipeline from
`scripts/parse_doc2dict_with_config.py:115-266` (called from `parse_one` at line `4127`),
but construct a `rows` record where the inline sig-page tail `sig_rest` includes two
different signers whose lines share identical text, e.g. `"By:\nTitle:
Director\nBy:\nTitle: Director\n"` so that both signers have the same `Title:` line.
2. When `_split_iww_and_sig_from_body` reaches lines `240-246`, it calls
`_split_sig_block_body_into_lines(sig_rest)` at line `240`, which returns `lines` as
`["By:", "Title: Director", "By:", "Title: Director"]` after trimming whitespace and
dropping footer chrome.
3. The deduplication loop at lines `3540-3544` initializes `seen_lines: set[str] = set()`
and then iterates `for line in lines:`; the first `"Title: Director"` is added to
`seen_lines`, but when the second signer's `"Title: Director"` is encountered, the `if
line in seen_lines: continue` branch is taken, so that duplicate line is skipped and never
emitted as a separate sig-line record.
4. Downstream, `_explode_signature_block_lines` at
`scripts/parse_doc2dict_with_config.py:269-555` and
`_merge_body_only_sig_fragments_in_order` at `558-615` operate only on the deduplicated
set; the second signer's title line has been lost permanently, so any consumers of
`parse_one`'s `sections` output (e.g. the Parquet written in `main` at `872-895`) see
incomplete signature information when there are structurally valid repeated lines across
signers.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 3540:3544
**Comment:**
*Logic Error: Global deduplication of signature lines removes repeated lines anywhere in the block, but repeated values are valid across multiple signers (for example identical `By:`/`Title:` lines). This causes data loss by dropping legitimate signer rows. Restrict deduplication to clearly duplicated adjacent lines from parser duplication artifacts, or keep all lines and let downstream grouping handle repeats.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 4120-4127: Add a unit/freeze fixture that exercises the
IWW-in-body rescue by constructing a parser input where a body node contains the
"IN WITNESS WHEREOF" phrase mid-body followed by signature lines so
_split_iww_and_sig_from_body actually performs a split; update or add a test
(targeting the parser run that produces the sections list and the post-split
structure) to assert that an IWW L1 node is emitted and that each signature line
becomes L2 children in the expected order and parentage, ensuring the case is
not a no-op (idx=6) and fails until the split code path runs.
- Around line 3452-3465: There are three separate IN WITNESS WHEREOF matchers
(_IWW_INLINE_RE in this function, the _is_iww_clause() helper, and logic in
_find_signature_cutoff); consolidate them by defining a single module-level
compiled regex (e.g., IWW_RE) and update all call sites to use it: replace the
local _IWW_INLINE_RE with the module-level name in
parse_doc2dict_with_config.py, refactor _is_iww_clause() to consult IWW_RE, and
change the fallback/cutoff logic inside _find_signature_cutoff to use IWW_RE as
well so the phrase matching is compiled once and consistent across
_is_iww_clause, _find_signature_cutoff, and the loop that checks rows.
- Around line 3501-3525: The synthetic IWW node is being attached to
r.get("parent_node_id") (the contaminated subsection) which yields an
inconsistent tree; instead set iww_record["parent_node_id"] to the nearest
ancestor at the current subdoc container level (i.e., the node whose depth ==
l1_depth - 1). To fix: before creating iww_record, compute container_root_id by
walking the parent chain (using existing row parent_node_id values in r and/or
new_rows) until you find a node with depth == l1_depth - 1, then assign
parent_node_id = container_root_id rather than r.get("parent_node_id"); keep
other fields (subdoc_penalty, l1_depth, iww_node_id, next_id) unchanged.
- Around line 3539-3544: The current loop over lines uses seen_lines to globally
dedupe duplicates which removes non-consecutive repeats; change it to only
collapse adjacent duplicate signature lines by replacing the seen_lines set
logic with a previous-line check: iterate through lines produced by
_split_sig_block_body_into_lines(sig_rest), keep a variable like prev_line
(initially None), skip the current line only if line == prev_line (to collapse
consecutive duplicates), otherwise emit/add the line and set prev_line = line;
update the code paths around variables lines, line and seen_lines accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 724a85b1-5c7d-4891-a2f1-d73263ddf14c
📒 Files selected for processing (3)
data/auto_parse/level_freeze/frozen/idx_6.jsonldata/auto_parse/level_freeze/state.jsonscripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (Custom checks)
**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run<cli> --help, assert exit code 0. Fail if smoke test fails.
Runuv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -qfor Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Runuv run ruff check . --difffor Python linting. Fail if exit code is non-zero and list each violation.
Runuv run ruff format --check --diff .for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Runuv run ruff check --select I,F401 .to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite:uv run pytest --tb=line -qon origin/main to capture baseline pass/fail counts, anduv run pytest --tb=short -qon PR branch. Fail immediately if exit code is non-zero.
Runuv run typy checkfor Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new baretype: ignorecomments (without error codes) in Python files andcast()calls without explanatory comments. Warn for each. Fail if baretype: ignorecount > 3.
Files:
scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}
📄 CodeRabbit inference engine (Custom checks)
For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.
Files:
scripts/parse_doc2dict_with_config.py
🔍 Remote MCP
Summary of Relevant Context Found
Based on my research across available tools, here's the additional context relevant to this PR review:
Document Source & Context
The ARRIS Employment Agreement is between ARRIS Group, Inc. and Timothy O'Loughlin (President, North American Sales). O'Loughlin spent 18 years as President of the Americas Business Unit and ran Americas Sales and global marketing for ARRIS prior to its sale to CommScope, establishing the executive-level nature of this employment agreement.
"IN WITNESS WHEREOF" Clause Context
The PR's defensive handling of IWW clauses is contextually important. "IN WITNESS WHEREOF" serves as the formal conclusion to a contract or legal document, indicating that the parties have agreed to the terms, and typically precedes the signature blocks where each party signs and dates the document. "In witness whereof" is an old legal phrase signaling the signing section of a contract—not legally required, but still widely used, and appears in formal closing phrases in contracts, deeds, and other legal documents that signal the end of the agreement's terms and introduces the signature block. This underscores why the _split_iww_and_sig_from_body function acts as a corpus-wide safety net when documents have embedded IWW content within body text rather than as a standalone structural element.
Section Detection & Regex Patterns
The PR's refinement to filter out textual cross-references (e.g., "this Section 5") aligns with broader document parsing challenges. Document parsers handle numerous formatting variations by trying multiple regex patterns and falling back gracefully. Parsers identify unique identifiers or paragraph markers, so patterns like "Part I: Blah Blah" become coded section references, and lettered subsections follow hierarchical patterns. The 15-character lookbehind check in _split_inline_section_markers prevents false positives from inline references—a pragmatic solution to a common parsing problem.
Verification Status
The PR demonstrates comprehensive regression testing: the output shows 69 records with documented hierarchical structure (L0:1, L1:14, L2:45, L3:9), word coverage at 99.4%, character ratio at 99.7%, and all prior frozen indices (0–5) remain byte-identical after parser changes. This validates both the new preprocessing passes and the section-reference filter changes.
Note on User Requirements: The user requirements mention "Stitch" tools for design recommendations; however, these tools are not available in the current toolset. This PR is a code/parser review, not a design task, so those requirements do not apply.
🔇 Additional comments (4)
data/auto_parse/level_freeze/frozen/idx_6.jsonl (1)
1-69: LGTM!data/auto_parse/level_freeze/state.json (2)
9-10: LGTM!
143-166: LGTM!scripts/parse_doc2dict_with_config.py (1)
2419-2436: LGTM!
| # Guard: only act when no existing record carries IWW as leading text. | ||
| # Without this, we'd double-split documents that already have a clean | ||
| # IWW record (the prior idxs) and break their frozen baselines. | ||
| for r in rows: | ||
| if r.get("is_envelope") or r.get("scope") == "trailer": | ||
| continue | ||
| title = (r.get("title") or "").strip() | ||
| body = (r.get("body_direct") or "").strip() | ||
| if _is_iww_clause(title) or _is_iww_clause(body): | ||
| return rows | ||
|
|
||
| _IWW_INLINE_RE = re.compile(r"(?i)IN\s+WITNESS\s+WHEREOF") | ||
|
|
||
| next_id = max((r["node_id"] for r in rows), default=-1) + 1 |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Centralize the IWW matcher instead of adding another local regex.
This path introduces a third definition of “IN WITNESS WHEREOF” matching (_IWW_INLINE_RE here, _is_iww_clause(), and the separate fallback logic in _find_signature_cutoff). That leaves the rescue, cutoff detection, and L1 pinning free to drift again. Please promote this phrase to one module-level compiled regex and route all call sites through it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 3452 - 3465, There are
three separate IN WITNESS WHEREOF matchers (_IWW_INLINE_RE in this function, the
_is_iww_clause() helper, and logic in _find_signature_cutoff); consolidate them
by defining a single module-level compiled regex (e.g., IWW_RE) and update all
call sites to use it: replace the local _IWW_INLINE_RE with the module-level
name in parse_doc2dict_with_config.py, refactor _is_iww_clause() to consult
IWW_RE, and change the fallback/cutoff logic inside _find_signature_cutoff to
use IWW_RE as well so the phrase matching is compiled once and consistent across
_is_iww_clause, _find_signature_cutoff, and the loop that checks rows.
| subdoc_penalty = r.get("subdoc_penalty") or 0 | ||
| l1_depth = 1 + subdoc_penalty | ||
| l2_depth = 2 + subdoc_penalty | ||
|
|
||
| # Emit the IWW operating clause as L1. | ||
| iww_record = { | ||
| "node_id": next_id, | ||
| "parent_node_id": r.get("parent_node_id"), | ||
| "depth": l1_depth, | ||
| "doc2dict_section_key": f"_iww_{next_id}", | ||
| "cls": "promoted text leaf", | ||
| "title": "", | ||
| "standardized_title": None, | ||
| "n_direct_section_children": 0, | ||
| "body_direct_chars": len(iww_sentence), | ||
| "body_recursive_chars": len(iww_sentence), | ||
| "body_direct": iww_sentence, | ||
| "body_recursive": iww_sentence, | ||
| "subdoc_penalty": subdoc_penalty, | ||
| "is_envelope": False, | ||
| "scope": "agreement", | ||
| } | ||
| new_rows.append(iww_record) | ||
| next_id += 1 | ||
| iww_node_id = iww_record["node_id"] |
There was a problem hiding this comment.
Reparent the synthetic IWW node to the agreement/subdoc root.
parent_node_id is copied from the contaminated subsection, so when this rescue fires the new depth-1 IWW row becomes a child of a deeper clause. That leaves the tree internally inconsistent and gives every emitted sig-line the wrong lineage in parquet/downstream tree-based passes. The parent here should be the nearest ancestor at the current subdoc_penalty container level, not r.get("parent_node_id").
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 3501 - 3525, The
synthetic IWW node is being attached to r.get("parent_node_id") (the
contaminated subsection) which yields an inconsistent tree; instead set
iww_record["parent_node_id"] to the nearest ancestor at the current subdoc
container level (i.e., the node whose depth == l1_depth - 1). To fix: before
creating iww_record, compute container_root_id by walking the parent chain
(using existing row parent_node_id values in r and/or new_rows) until you find a
node with depth == l1_depth - 1, then assign parent_node_id = container_root_id
rather than r.get("parent_node_id"); keep other fields (subdoc_penalty,
l1_depth, iww_node_id, next_id) unchanged.
| lines = _split_sig_block_body_into_lines(sig_rest) | ||
| seen_lines: set[str] = set() | ||
| for line in lines: | ||
| if line in seen_lines: | ||
| continue # collapse duplicate | ||
| seen_lines.add(line) |
There was a problem hiding this comment.
Only collapse adjacent duplicate signature lines.
The docstring says “duplicate consecutive lines are collapsed”, but the seen_lines set removes any repeated line anywhere in the block. That will erase legitimate later lines when two signers share the same company label or field text. Collapse only back-to-back duplicates here.
Suggested fix
- seen_lines: set[str] = set()
+ prev_line: str | None = None
for line in lines:
- if line in seen_lines:
+ if line == prev_line:
continue # collapse duplicate
- seen_lines.add(line)
+ prev_line = line
new_rows.append({📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lines = _split_sig_block_body_into_lines(sig_rest) | |
| seen_lines: set[str] = set() | |
| for line in lines: | |
| if line in seen_lines: | |
| continue # collapse duplicate | |
| seen_lines.add(line) | |
| lines = _split_sig_block_body_into_lines(sig_rest) | |
| prev_line: str | None = None | |
| for line in lines: | |
| if line == prev_line: | |
| continue # collapse duplicate | |
| prev_line = line |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 3539 - 3544, The current
loop over lines uses seen_lines to globally dedupe duplicates which removes
non-consecutive repeats; change it to only collapse adjacent duplicate signature
lines by replacing the seen_lines set logic with a previous-line check: iterate
through lines produced by _split_sig_block_body_into_lines(sig_rest), keep a
variable like prev_line (initially None), skip the current line only if line ==
prev_line (to collapse consecutive duplicates), otherwise emit/add the line and
set prev_line = line; update the code paths around variables lines, line and
seen_lines accordingly.
| # Split inline "IN WITNESS WHEREOF" + signature page out of a body. | ||
| # doc2dict sometimes packs the entire sig page into the trailing body | ||
| # of the LAST lettered subsection when the HTML walker couldn't | ||
| # promote it. This pass detects that pathology STRUCTURALLY (no | ||
| # existing IWW record in the document, IWW phrase appearing mid-body) | ||
| # and emits IWW as L1 + each sig-page line as L2 — restoring the | ||
| # title-as-root sig-page hierarchy. | ||
| sections = _split_iww_and_sig_from_body(sections) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add a fixture that actually exercises the IWW-in-body rescue.
Per the PR summary, idx=6 is a no-op here, so this new safety net can merge without a regression that proves the split path, parentage, and sig-line ordering. Please add or point to a targeted parser/freeze case where a body node contains mid-body IWW plus signature lines.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 4120 - 4127, Add a
unit/freeze fixture that exercises the IWW-in-body rescue by constructing a
parser input where a body node contains the "IN WITNESS WHEREOF" phrase mid-body
followed by signature lines so _split_iww_and_sig_from_body actually performs a
split; update or add a test (targeting the parser run that produces the sections
list and the post-split structure) to assert that an IWW L1 node is emitted and
that each signature line becomes L2 children in the expected order and
parentage, ensuring the case is not a no-op (idx=6) and fails until the split
code path runs.
|
Triage agent — PR #79 comment review (read-only pass, no code changes) 8 inline comments reviewed:
WILL-DEFER items (8 distinct): global sig-line dedup → adjacent-only; 2x module-level regex hoisting; section-ref filter false positives; IWW Triage only — no code changes made this round. |
User description
Summary
Seventh stacked PR. Adds idx=6 (EMPLOYMENT AGREEMENT between ARRIS Group, Inc. and Timothy O'Loughlin, January 4, 2016) as the seventh verified frozen baseline on top of idx=5 (PR #78).
Parser changes (2 surgical, shape-driven)
_split_iww_and_sig_from_body(new, ~lines 3414-3565) — defensive rescue for when doc2dict packs the entire sig page (IWW + party labels + /s/ + By/Name/Title) into the trailing body of the last numbered subsection. Guards: returns rows unchanged if any record already starts withIN WITNESS WHEREOFas a standalone span. When the guard passes and a body contains IWW mid-text, splits into prefix + IWW (L1) + per-line sig records (L2 marked_sig_line=Trueso the existing_consolidate_sig_lines_after_iwwanchors them).For idx=6 this is a no-op — doc2dict naturally gave 12 separate leaf nodes (1 IWW + 11 sig fragments). The function is a corpus-wide safety net for the OTHER pattern where doc2dict packs everything into one body. All 6 prior idxs verified untouched by the function.
Section-reference filter in
_split_inline_section_markers(line ~2385): cross-references like"this Section 5."or"Sections 7, 8 and 9 hereof"were being picked up by the marker scanner as candidate section starts, breaking the gap-1 monotonic check. Added a 15-char lookbehind dropping any candidate preceded by\bSections?\s*[\xa0\s]*$. For idx=6 this reduced[5, 11, 12]→[11, 12](clean monotonic), allowing Sections 11 and 12 to emit as their own L1 records.Verified output for idx=6
{L0:1, L1:14, L2:45, L3:9}Top-level structure
Sig area (doc2dict's natural per-line fragmentation, preserved per rubric)
The 11-record sig split is doc2dict's natural HTML grouping (each line was a separate
<p>element in the source). Per the rubric's "preserve doc2dict natural grouping at depth 2" rule, the parser preserves whatever doc2dict provides — does not impose per-line splitting and does not merge across parties. Inspector verified each of these 11 records corresponds to a separatecls='promoted text leaf'node in the raw parquet (node_ids 60-70).Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 7 --no-truncate --output-dir data/auto_parseexits 0 withok 7uv run scripts/level_loop/freeze.py 6 --forcereports word_coverage ≥ 90% (99.4%)uv run scripts/level_loop/regress.pyreports all 7 frozen idxs OKSource
http://www.sec.gov/Archives/edgar/data/1645494/000119312517164198/d342430dex101.htm
Why this matters for the corpus
_split_iww_and_sig_from_bodyadds a defensive layer for the cross-corpus pattern where doc2dict packs the sig page into the last subsection's body instead of emitting it as separate nodes. Guarded by IWW-already-present check so it cannot regress agreements where the sig page is already structurally separate.🤖 Generated with Claude Code
CodeAnt-AI Description
Parse agreements with embedded section headings and signature pages correctly
What Changed
Impact
✅ Fewer missing section splits✅ Correct signature-page ordering✅ More complete agreement extractions🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.