idx=12: freeze (412 records) — Triton Container Ninth Restated Improve IWW detection in signature-page explosion logicedit Agreement (widened IWW + ancestor up-walk)#85
Conversation
… Agreement: IWW carrier detected when title is page-chrome filler, sig parties demoted to L2 via ancestor up-walk idx=12 is the Ninth Restated and Amended Credit Agreement among Triton Container International Limited (Bermuda), various lenders, and Bank of America (administrative agent), filed as SEC EX-10. Source URL: www.sec.gov/Archives/edgar/data/1660734/000166073417000038/tritoncontainerninthrestat.htm Two shape-based fixes in _explode_signature_block_lines, both keyed on the same structural cue: doc2dict packed the page-chrome filler "[Remainder of page intentionally left blank]" into the title of the IWW carrier so the IWW operating sentence ended up in the body, not at the start of the joined span. 1. PASS-2 IWW carrier detection now checks both _is_iww_clause(span) AND _is_iww_clause(body). The body-alone check fires when title is page-chrome filler ahead of the IWW sentence. The same widened check is applied at every IWW-skip point inside the IWW-anchored branch (sibling loop, descendant walk, PASS 3 demotion) so the IWW carrier is never inadvertently demoted to L2. 2. sig_area_parent_ids now expands via an ancestor up-walk (bounded to 4 hops) when the IWW carrier's immediate siblings carry no sig-shape records. For idx=12, the IWW lives as a child of an all-caps continuation node (nid=247, the "CONNECTION WITH THIS AGREEMENT…" body fragment of 15.16 Waiver of Jury Trial that doc2dict split off a page-break). The actual sig parties (TRITON CONTAINER, BoA, MUFG, SunTrust, Wells Fargo) are siblings of nid=247 under SECTION 15. GENERAL (nid=246). The up-walk stops at the L0 title, at section-marker ancestors, or as soon as sig-shape siblings are found at the current level — bounded so it cannot reach top-level body clauses. idx=2 (the other [Remainder of page]/IWW agreement in the corpus) keeps its frozen baseline because _consolidate_sig_lines_after_iww still uses the strict _is_iww_clause(span) check, so the existing PANDORA/KKR /s/-carrier placement is unchanged. Stats: 412 records, levels {0: 1, 1: 22, 2: 336, 3: 53}, word coverage 97.1%, char ratio 97.3%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR enhances the signature-page explosion logic to robustly detect and preserve the IWW (IN WITNESS WHEREOF) operating clause, then records the completion of a freeze operation for document index 12. The core behavioral change improves how the algorithm identifies IWW text and computes signature-area parent candidates across multiple scan phases. ChangesSignature-Page IWW Handling and Freeze
Sequence DiagramsequenceDiagram
participant PASS2_Detect as PASS-2: IWW Detection
participant PASS2_Parents as PASS-2: Parent Walk
participant PASS2_Sibling as PASS-2: Sibling Skip
participant PASS2_Desc as PASS-2: Descendant Skip
participant PASS3_Pin as PASS-3: Depth Pin
PASS2_Detect->>PASS2_Parents: iww_present, iww_carriers (span or body)
PASS2_Parents->>PASS2_Sibling: sig_area_parent_ids (bounded walk)
PASS2_Sibling->>PASS2_Desc: skip IWW (span or body)
PASS2_Desc->>PASS3_Pin: skip IWW (span or body)
PASS3_Pin->>PASS3_Pin: keep IWW at L1 (span or body)
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Suggested Labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the IWW (In Witness Whereof) detection logic by checking both the combined span and the direct body text, which prevents filler text in titles from obscuring the IWW anchor. It also introduces an up-walk mechanism to locate signature areas by searching up to four ancestor levels for signature-shaped siblings. Review feedback identifies a regression where root-level IWW carriers are skipped in the new parent ID initialization and suggests a fix. Additionally, the reviewer recommends refactoring duplicated signature detection logic into a helper function to improve maintainability.
| sig_area_parent_ids: set[int | None] = set() | ||
| for iww in iww_carriers: | ||
| cur_pid = iww.get("parent_node_id") | ||
| iww_seen: set[int | None] = set() | ||
| walked = 0 | ||
| while cur_pid is not None and cur_pid not in iww_seen and walked < 4: |
There was a problem hiding this comment.
The initialization of sig_area_parent_ids as an empty set, combined with the while cur_pid is not None condition, introduces a regression where root-level IWW carriers (those with parent_node_id=None) no longer have their siblings checked for signature shapes. The original implementation correctly included None in the set of parent IDs to consider.
To fix this, initialize sig_area_parent_ids with the immediate parents of all IWW carriers, which restores the original behavior for root nodes, and then perform the up-walk for non-root parents.
| sig_area_parent_ids: set[int | None] = set() | |
| for iww in iww_carriers: | |
| cur_pid = iww.get("parent_node_id") | |
| iww_seen: set[int | None] = set() | |
| walked = 0 | |
| while cur_pid is not None and cur_pid not in iww_seen and walked < 4: | |
| sig_area_parent_ids: set[int | None] = { | |
| iww.get("parent_node_id") for iww in iww_carriers | |
| } | |
| for iww in iww_carriers: | |
| cur_pid = iww.get("parent_node_id") | |
| if cur_pid is None: | |
| continue | |
| iww_seen: set[int | None] = set() | |
| walked = 0 | |
| while cur_pid is not None and cur_pid not in iww_seen and walked < 4: |
| if ( | ||
| _SIG_FIELD_RE.match(s_title) | ||
| or _SIG_FIELD_RE.match(s_body) | ||
| or (s_title and _SIG_BLOCK_LABEL_RE.match(s_title)) | ||
| or (s_title and _CORP_SUFFIX_LABEL_RE.match(s_title)) | ||
| or (not s_title and s_body and _SIG_FIELD_RE.match(s_body)) | ||
| ): |
There was a problem hiding this comment.
The logic for detecting signature-shaped records is now duplicated in multiple places within this function (the up-walk, the sibling loop, and the descendant loop). This increases the risk of inconsistencies if the signature detection rules need to be updated. Consider extracting this logic into a local helper function to improve maintainability.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 4265-4270: The up-walk loop starting from cur_pid =
iww.get("parent_node_id") skips adding a search base when parent_node_id is
None, so root-level IWW carriers never seed sibling discovery; fix by detecting
if iww.get("parent_node_id") is None before the while and explicitly add the
IWW's node id (e.g., iww.get("node_id")) or an appropriate root search key into
the up-walk seed collection so root-level sig-shape siblings are included, then
proceed with the existing while using cur_pid, iww_seen, walked and by_node_id
as before.
- Around line 4210-4212: In the /s/ PASS-3 branch currently using a span-only
IWW check, update the conditional to use the same span-or-body guard as earlier:
replace uses of _is_iww_clause(span) with (_is_iww_clause(span) or
_is_iww_clause(body)) (reusing the existing body = (r.get("body_direct") or
"").strip() value) so that the branch that sets r["depth"] for PASS-3 respects
IWW found in body_direct as well; keep the rest of the branch logic (r["depth"]
assignments and subdoc_penalty) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a3aaa833-6721-475c-a7de-ecb9e97b3db6
📒 Files selected for processing (3)
data/auto_parse/level_freeze/frozen/idx_12.jsonldata/auto_parse/level_freeze/state.jsonscripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (Custom checks)
**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run<cli> --help, assert exit code 0. Fail if smoke test fails.
Runuv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -qfor Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Runuv run ruff check . --difffor Python linting. Fail if exit code is non-zero and list each violation.
Runuv run ruff format --check --diff .for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Runuv run ruff check --select I,F401 .to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite:uv run pytest --tb=line -qon origin/main to capture baseline pass/fail counts, anduv run pytest --tb=short -qon PR branch. Fail immediately if exit code is non-zero.
Runuv run typy checkfor Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new baretype: ignorecomments (without error codes) in Python files andcast()calls without explanatory comments. Warn for each. Fail if baretype: ignorecount > 3.
Files:
scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}
📄 CodeRabbit inference engine (Custom checks)
For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.
Files:
scripts/parse_doc2dict_with_config.py
🔍 Remote MCP
Based on my search efforts, I was unable to access the repository-specific documentation (DeepWiki does not have the arthrod/clause-extract repository indexed), and Context7 encountered rate limits. However, I was able to gather valuable context about IWW clauses from web search results.
Additional Context Gathered
Understanding IWW (IN WITNESS WHEREOF) Clauses
The "IN WITNESS WHEREOF" clause serves as the formal conclusion to a contract or legal document, indicating that the parties have agreed to the terms and are executing the agreement. Typically, this clause precedes the signature blocks, where each party signs and dates the document, thereby making it legally binding.
This context is directly relevant to PR #85's parser improvements, which focus on correctly detecting and isolating IWW clauses in document parsing. The PR's widened IWW detection (checking both span and body) addresses cases where page formatting artifacts may separate the IWW phrase from associated signature block content.
Key Challenges in IWW Detection
The web search results highlight why parser logic for IWW detection is challenging:
- Sometimes contracts include lines like "In witness whereof, the undersigned have executed this agreement," even though there's no actual witness signature on the document, which can raise questions about authenticity or suggest the document is incomplete.
- A growing number of lawyers and organizations have moved away from "In Witness Whereof" entirely, replacing it with straightforward English or dropping the testimonium clause altogether, with common alternatives including "Signed by the parties on [date]," "Agreed to and signed," or simply placing a date and signature block after the last contractual provision with no introductory phrase at all.
Signature Block Structure
A signature block is the section of a legal document where the parties sign their names to indicate their consent and commitment to the agreement, and it typically includes names, titles, and date lines. The parser's ancestor up-walk logic appears designed to correctly identify these signature regions within hierarchical document structures.
Assessment
The PR's parser improvements are well-targeted for real-world document structure challenges: documents where page layout places IWW and signature blocks in non-obvious hierarchical relationships, and where OCR or page-chrome artifacts (like "[Remainder of page intentionally left blank]") create parsing ambiguities.
🔇 Additional comments (2)
scripts/parse_doc2dict_with_config.py (1)
4321-4325: LGTM!Also applies to: 4345-4347, 4370-4371
data/auto_parse/level_freeze/state.json (1)
15-16: LGTM!Also applies to: 203-208
| body = (r.get("body_direct") or "").strip() | ||
| if _is_iww_clause(span) or _is_iww_clause(body): | ||
| r["depth"] = 1 + (r.get("subdoc_penalty") or 0) |
There was a problem hiding this comment.
Apply span-or-body IWW guard in the /s/ PASS-3 branch too.
This change widens IWW detection here, but Line 4535 still uses span-only exclusion. In documents where IWW is in body_direct and title has filler text, the /s/ branch can still demote the IWW carrier to L2.
Suggested fix
- if _is_iww_clause(_span_text(r)):
+ r_body = (r.get("body_direct") or "").strip()
+ if _is_iww_clause(_span_text(r)) or _is_iww_clause(r_body):
continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 4210 - 4212, In the /s/
PASS-3 branch currently using a span-only IWW check, update the conditional to
use the same span-or-body guard as earlier: replace uses of _is_iww_clause(span)
with (_is_iww_clause(span) or _is_iww_clause(body)) (reusing the existing body =
(r.get("body_direct") or "").strip() value) so that the branch that sets
r["depth"] for PASS-3 respects IWW found in body_direct as well; keep the rest
of the branch logic (r["depth"] assignments and subdoc_penalty) unchanged.
| cur_pid = iww.get("parent_node_id") | ||
| iww_seen: set[int | None] = set() | ||
| walked = 0 | ||
| while cur_pid is not None and cur_pid not in iww_seen and walked < 4: | ||
| iww_seen.add(cur_pid) | ||
| parent_rec = by_node_id.get(cur_pid) |
There was a problem hiding this comment.
Handle root-level IWW carriers in the up-walk seed.
If an IWW carrier has parent_node_id=None, the current loop never adds a search base, so root-level sig-shape siblings are never discovered in the IWW-only (no /s/) path.
Suggested fix
sig_area_parent_ids: set[int | None] = set()
for iww in iww_carriers:
cur_pid = iww.get("parent_node_id")
+ if cur_pid is None:
+ # Root-level IWW: scan root siblings as signature-area candidates.
+ sig_area_parent_ids.add(None)
+ continue
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cur_pid = iww.get("parent_node_id") | |
| iww_seen: set[int | None] = set() | |
| walked = 0 | |
| while cur_pid is not None and cur_pid not in iww_seen and walked < 4: | |
| iww_seen.add(cur_pid) | |
| parent_rec = by_node_id.get(cur_pid) | |
| cur_pid = iww.get("parent_node_id") | |
| if cur_pid is None: | |
| # Root-level IWW: scan root siblings as signature-area candidates. | |
| sig_area_parent_ids.add(None) | |
| continue | |
| iww_seen: set[int | None] = set() | |
| walked = 0 | |
| while cur_pid is not None and cur_pid not in iww_seen and walked < 4: | |
| iww_seen.add(cur_pid) | |
| parent_rec = by_node_id.get(cur_pid) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 4265 - 4270, The up-walk
loop starting from cur_pid = iww.get("parent_node_id") skips adding a search
base when parent_node_id is None, so root-level IWW carriers never seed sibling
discovery; fix by detecting if iww.get("parent_node_id") is None before the
while and explicitly add the IWW's node id (e.g., iww.get("node_id")) or an
appropriate root search key into the up-walk seed collection so root-level
sig-shape siblings are included, then proceed with the existing while using
cur_pid, iww_seen, walked and by_node_id as before.
|
CodeAnt AI finished reviewing your PR. |
User description
Summary
Thirteenth stacked PR. Adds idx=12 (NINTH RESTATED AND AMENDED CREDIT AGREEMENT, Triton Container International Ltd. + Bank of America + MUFG + SunTrust + Wells Fargo, April 2016) as the thirteenth verified frozen baseline on top of idx=11 (PR #84).
Parser changes (2 surgical, shape-driven; both inside
_explode_signature_block_lines)PASS-2 IWW detection widened to check
_is_iww_clause(span)OR_is_iww_clause(body). doc2dict packed "[Remainder of page intentionally left blank]" into the IWW carrier's title (nid=248) while the IWW operating sentence ended up in the body. The original_is_iww_clauseonly matched IWW at the START of the joined span, so the carrier was missed. Widened check applied at every IWW-skip point. Strict^\s*IN\s+WITNESSanchor still applies — cannot match mid-text IWW phrases.Bounded ancestor up-walk (≤4 hops) for
sig_area_parent_idswhen the IWW carrier's immediate siblings carry no sig-shape records. Walks up the parent chain, looking for siblings of an ancestor that DO carry sig shapes. Three early-exit guards: (a)depth == 0(L0 title), (b)_has_section_marker_title(real agreement clause), (c) sig-shape sibling appears at current level. Cycle protection viaiww_seenset + hard cap of 4 hops. For idx=12: IWW carrier (nid=248) is under nid=247 "CONNECTION WITH THIS AGREEMENT…" (page-break-split tail of Section 15.16 Waiver of Jury Trial); the actual sig parties sit as siblings of nid=247 under SECTION 15. GENERAL (nid=246). The up-walk finds them in 1 hop.Verified output for idx=12
{L0:1, L1:22, L2:336, L3:53}(max depth 3)IWW + sig area (verbatim)
idx=2 preserved (the OTHER "[Remainder]+IWW" agreement)
idx=2 (PR #75) has
/s/marks (orders 420, 421), so it flows through the unmodifiedhas_slash_stree-chain branch. The form/template up-walk branch is gated bynot has_slash_s and iww_present— never reached for idx=2. Inspector verified idx=2 byte-identical to its baseline (422 records, PANDORA/KKR sig blocks at L2 unchanged).Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 13 --no-truncate --output-dir data/auto_parseexits 0 withok 13uv run scripts/level_loop/freeze.py 12 --forcereports word_coverage ≥ 90% (97.1%)uv run scripts/level_loop/regress.pyreports all 13 frozen idxs OK^\s*IN\s+WITNESSanchor still applies to widened span-OR-body check (no mid-text false positives)Source
http://www.sec.gov/Archives/edgar/data/1660734/000166073417000038/tritoncontainerninthrestat.htm
🤖 Generated with Claude Code
CodeAnt-AI Description
Fix signature block freezing for agreements where the witness clause is split across title and body text
What Changed
Impact
✅ Fewer missed signature blocks✅ Correcter signer grouping on page-split agreements✅ More accurate frozen parses for restated credit agreements🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.