Skip to content

idx=12: freeze (412 records) — Triton Container Ninth Restated Improve IWW detection in signature-page explosion logicedit Agreement (widened IWW + ancestor up-walk)#85

Open
arthrod wants to merge 1 commit into
redo/idx-11from
redo/idx-12
Open

idx=12: freeze (412 records) — Triton Container Ninth Restated Improve IWW detection in signature-page explosion logicedit Agreement (widened IWW + ancestor up-walk)#85
arthrod wants to merge 1 commit into
redo/idx-11from
redo/idx-12

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Thirteenth stacked PR. Adds idx=12 (NINTH RESTATED AND AMENDED CREDIT AGREEMENT, Triton Container International Ltd. + Bank of America + MUFG + SunTrust + Wells Fargo, April 2016) as the thirteenth verified frozen baseline on top of idx=11 (PR #84).

Parser changes (2 surgical, shape-driven; both inside _explode_signature_block_lines)

  1. PASS-2 IWW detection widened to check _is_iww_clause(span) OR _is_iww_clause(body). doc2dict packed "[Remainder of page intentionally left blank]" into the IWW carrier's title (nid=248) while the IWW operating sentence ended up in the body. The original _is_iww_clause only matched IWW at the START of the joined span, so the carrier was missed. Widened check applied at every IWW-skip point. Strict ^\s*IN\s+WITNESS anchor still applies — cannot match mid-text IWW phrases.

  2. Bounded ancestor up-walk (≤4 hops) for sig_area_parent_ids when the IWW carrier's immediate siblings carry no sig-shape records. Walks up the parent chain, looking for siblings of an ancestor that DO carry sig shapes. Three early-exit guards: (a) depth == 0 (L0 title), (b) _has_section_marker_title (real agreement clause), (c) sig-shape sibling appears at current level. Cycle protection via iww_seen set + hard cap of 4 hops. For idx=12: IWW carrier (nid=248) is under nid=247 "CONNECTION WITH THIS AGREEMENT…" (page-break-split tail of Section 15.16 Waiver of Jury Trial); the actual sig parties sit as siblings of nid=247 under SECTION 15. GENERAL (nid=246). The up-walk finds them in 1 hop.

Verified output for idx=12

  • 412 records, distribution {L0:1, L1:22, L2:336, L3:53} (max depth 3)
  • Reconstruction: word_coverage 97.1%, char_ratio 97.3%

IWW + sig area (verbatim)

o=405 L1: [Remainder of page intentionally left blank]
          IN WITNESS WHEREOF, the parties hereto have caused this Agreement to be executed by their respective officers thereunto duly authorized as of the day and year first above written.
o=406 L2: TRITON CONTAINER INTERNATIONAL LIMITED / By: / Name: / Title:
o=407 L2: BANK OF AMERICA, N.A. , as Administrative Agent / By: / Name: / Title:
o=408 L2: BANK OF AMERICA, N.A. , as a Lender and as an Issuer / By: / Name: Matthew N. Walt / Title: Vice President
o=409 L2: MUFG UNION BANK, N.A. / By: / Name: / Title:
o=410 L2: SUNTRUST BANK / By: / Name: / Title:
o=411 L2: WELLS FARGO BANK, N.A. / By: / Name: / Title:

idx=2 preserved (the OTHER "[Remainder]+IWW" agreement)

idx=2 (PR #75) has /s/ marks (orders 420, 421), so it flows through the unmodified has_slash_s tree-chain branch. The form/template up-walk branch is gated by not has_slash_s and iww_present — never reached for idx=2. Inspector verified idx=2 byte-identical to its baseline (422 records, PANDORA/KKR sig blocks at L2 unchanged).

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 13 --no-truncate --output-dir data/auto_parse exits 0 with ok 13
  • uv run scripts/level_loop/freeze.py 12 --force reports word_coverage ≥ 90% (97.1%)
  • uv run scripts/level_loop/regress.py reports all 13 frozen idxs OK
  • Inspector verified all 12 prior idxs byte-identical via shasum diff vs HEAD
  • Inspector independently verified the strict ^\s*IN\s+WITNESS anchor still applies to widened span-OR-body check (no mid-text false positives)
  • Inspector verified 4-hop up-walk safety guards (L0, section-marker, sig-shape-sibling)

Source

http://www.sec.gov/Archives/edgar/data/1660734/000166073417000038/tritoncontainerninthrestat.htm

🤖 Generated with Claude Code


CodeAnt-AI Description

Fix signature block freezing for agreements where the witness clause is split across title and body text

What Changed

  • Signature parties are now recognized even when the “I WITNESS” sentence appears only in the body and the title contains page filler text
  • Signature lines are now found one level higher when the immediate sibling group is only a continuation fragment, so the real signers are frozen at the correct level
  • The IWW marker is left at its original level instead of being demoted with the signature lines

Impact

✅ Fewer missed signature blocks
✅ Correcter signer grouping on page-split agreements
✅ More accurate frozen parses for restated credit agreements

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

… Agreement: IWW carrier detected when title is page-chrome filler, sig parties demoted to L2 via ancestor up-walk

idx=12 is the Ninth Restated and Amended Credit Agreement among Triton
Container International Limited (Bermuda), various lenders, and Bank
of America (administrative agent), filed as SEC EX-10. Source URL:
www.sec.gov/Archives/edgar/data/1660734/000166073417000038/tritoncontainerninthrestat.htm

Two shape-based fixes in _explode_signature_block_lines, both keyed on
the same structural cue: doc2dict packed the page-chrome filler
"[Remainder of page intentionally left blank]" into the title of the
IWW carrier so the IWW operating sentence ended up in the body, not
at the start of the joined span.

1. PASS-2 IWW carrier detection now checks both _is_iww_clause(span)
   AND _is_iww_clause(body). The body-alone check fires when title is
   page-chrome filler ahead of the IWW sentence. The same widened
   check is applied at every IWW-skip point inside the IWW-anchored
   branch (sibling loop, descendant walk, PASS 3 demotion) so the IWW
   carrier is never inadvertently demoted to L2.

2. sig_area_parent_ids now expands via an ancestor up-walk (bounded
   to 4 hops) when the IWW carrier's immediate siblings carry no
   sig-shape records. For idx=12, the IWW lives as a child of an
   all-caps continuation node (nid=247, the "CONNECTION WITH THIS
   AGREEMENT…" body fragment of 15.16 Waiver of Jury Trial that
   doc2dict split off a page-break). The actual sig parties (TRITON
   CONTAINER, BoA, MUFG, SunTrust, Wells Fargo) are siblings of
   nid=247 under SECTION 15. GENERAL (nid=246). The up-walk stops at
   the L0 title, at section-marker ancestors, or as soon as sig-shape
   siblings are found at the current level — bounded so it cannot
   reach top-level body clauses.

idx=2 (the other [Remainder of page]/IWW agreement in the corpus)
keeps its frozen baseline because _consolidate_sig_lines_after_iww
still uses the strict _is_iww_clause(span) check, so the existing
PANDORA/KKR /s/-carrier placement is unchanged.

Stats: 412 records, levels {0: 1, 1: 22, 2: 336, 3: 53}, word
coverage 97.1%, char ratio 97.3%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection and handling of witness clauses in document parsing, preventing missed identifications in signature areas.
    • Enhanced signature block extraction logic with better boundary detection and parent element traversal.
  • Chores

    • Updated internal parsing state records.

Walkthrough

This PR enhances the signature-page explosion logic to robustly detect and preserve the IWW (IN WITNESS WHEREOF) operating clause, then records the completion of a freeze operation for document index 12. The core behavioral change improves how the algorithm identifies IWW text and computes signature-area parent candidates across multiple scan phases.

Changes

Signature-Page IWW Handling and Freeze

Layer / File(s) Summary
IWW Detection and Parent Computation Refactoring
scripts/parse_doc2dict_with_config.py
IWW-carrier detection now matches IWW text in either span or body, broadening coverage when non-IWW titles mask IWW content. Parent candidate computation shifts from collecting immediate parents to bounded upward traversal per carrier with stopping conditions on L0 titles, agreement-clause ancestors, and sig-shaped siblings. PASS-2 sibling and descendant scanning, plus PASS-3 depth pinning, apply consistent span-or-body IWW detection throughout to prevent misclassification.
Freeze State Update for Index 12
data/auto_parse/level_freeze/state.json
Frozen index list extended to include 12, and history entry appended recording the freeze operation with 412 records at timestamp 2026-05-17T08:44:03.

Sequence Diagram

sequenceDiagram
  participant PASS2_Detect as PASS-2: IWW Detection
  participant PASS2_Parents as PASS-2: Parent Walk
  participant PASS2_Sibling as PASS-2: Sibling Skip
  participant PASS2_Desc as PASS-2: Descendant Skip
  participant PASS3_Pin as PASS-3: Depth Pin
  
  PASS2_Detect->>PASS2_Parents: iww_present, iww_carriers (span or body)
  PASS2_Parents->>PASS2_Sibling: sig_area_parent_ids (bounded walk)
  PASS2_Sibling->>PASS2_Desc: skip IWW (span or body)
  PASS2_Desc->>PASS3_Pin: skip IWW (span or body)
  PASS3_Pin->>PASS3_Pin: keep IWW at L1 (span or body)
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • arthrod/clause-extract#36: Updates the same state.json freeze machinery by extending the frozen list and appending history entries for a newly frozen index.
  • arthrod/clause-extract#17: Modifies the same level-freeze state structure with extended frozen index list and appended history entries for a new freeze step.
  • arthrod/clause-extract#31: Updates state.json by extending the frozen list and appending history entries for a newly frozen index, though does not include script logic changes.

Suggested Labels

Feat2

Poem

🐰 A clause hides when titles don't match,
But now we hunt in body and span both,
IWW escapes through refactored paths,
Walking upward with wisdom, not wrath,
Index twelve is frozen—our logic made whole!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The PR description comprehensively describes the changeset: adding idx=12 as a frozen baseline with two surgical parser changes to IWW detection and ancestor up-walk logic, with verified output metrics and test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title Check ✅ Passed Title check skipped as CodeRabbit has written the PR title.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot changed the title idx=12: freeze (412 records) — Triton Container Ninth Restated Credit Agreement (widened IWW + ancestor up-walk) idx=12: freeze (412 records) — Triton Container Ninth Restated Improve IWW detection in signature-page explosion logicedit Agreement (widened IWW + ancestor up-walk) May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the IWW (In Witness Whereof) detection logic by checking both the combined span and the direct body text, which prevents filler text in titles from obscuring the IWW anchor. It also introduces an up-walk mechanism to locate signature areas by searching up to four ancestor levels for signature-shaped siblings. Review feedback identifies a regression where root-level IWW carriers are skipped in the new parent ID initialization and suggests a fix. Additionally, the reviewer recommends refactoring duplicated signature detection logic into a helper function to improve maintainability.

Comment on lines +4263 to +4268
sig_area_parent_ids: set[int | None] = set()
for iww in iww_carriers:
cur_pid = iww.get("parent_node_id")
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The initialization of sig_area_parent_ids as an empty set, combined with the while cur_pid is not None condition, introduces a regression where root-level IWW carriers (those with parent_node_id=None) no longer have their siblings checked for signature shapes. The original implementation correctly included None in the set of parent IDs to consider.

To fix this, initialize sig_area_parent_ids with the immediate parents of all IWW carriers, which restores the original behavior for root nodes, and then perform the up-walk for non-root parents.

Suggested change
sig_area_parent_ids: set[int | None] = set()
for iww in iww_carriers:
cur_pid = iww.get("parent_node_id")
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:
sig_area_parent_ids: set[int | None] = {
iww.get("parent_node_id") for iww in iww_carriers
}
for iww in iww_carriers:
cur_pid = iww.get("parent_node_id")
if cur_pid is None:
continue
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:

Comment on lines +4299 to +4305
if (
_SIG_FIELD_RE.match(s_title)
or _SIG_FIELD_RE.match(s_body)
or (s_title and _SIG_BLOCK_LABEL_RE.match(s_title))
or (s_title and _CORP_SUFFIX_LABEL_RE.match(s_title))
or (not s_title and s_body and _SIG_FIELD_RE.match(s_body))
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for detecting signature-shaped records is now duplicated in multiple places within this function (the up-walk, the sibling loop, and the descendant loop). This increases the risk of inconsistencies if the signature detection rules need to be updated. Consider extracting this logic into a local helper function to improve maintainability.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 4265-4270: The up-walk loop starting from cur_pid =
iww.get("parent_node_id") skips adding a search base when parent_node_id is
None, so root-level IWW carriers never seed sibling discovery; fix by detecting
if iww.get("parent_node_id") is None before the while and explicitly add the
IWW's node id (e.g., iww.get("node_id")) or an appropriate root search key into
the up-walk seed collection so root-level sig-shape siblings are included, then
proceed with the existing while using cur_pid, iww_seen, walked and by_node_id
as before.
- Around line 4210-4212: In the /s/ PASS-3 branch currently using a span-only
IWW check, update the conditional to use the same span-or-body guard as earlier:
replace uses of _is_iww_clause(span) with (_is_iww_clause(span) or
_is_iww_clause(body)) (reusing the existing body = (r.get("body_direct") or
"").strip() value) so that the branch that sets r["depth"] for PASS-3 respects
IWW found in body_direct as well; keep the rest of the branch logic (r["depth"]
assignments and subdoc_penalty) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3aaa833-6721-475c-a7de-ecb9e97b3db6

📥 Commits

Reviewing files that changed from the base of the PR and between 721dd85 and 402a056.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_12.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🔍 Remote MCP

Based on my search efforts, I was unable to access the repository-specific documentation (DeepWiki does not have the arthrod/clause-extract repository indexed), and Context7 encountered rate limits. However, I was able to gather valuable context about IWW clauses from web search results.

Additional Context Gathered

Understanding IWW (IN WITNESS WHEREOF) Clauses

The "IN WITNESS WHEREOF" clause serves as the formal conclusion to a contract or legal document, indicating that the parties have agreed to the terms and are executing the agreement. Typically, this clause precedes the signature blocks, where each party signs and dates the document, thereby making it legally binding.

This context is directly relevant to PR #85's parser improvements, which focus on correctly detecting and isolating IWW clauses in document parsing. The PR's widened IWW detection (checking both span and body) addresses cases where page formatting artifacts may separate the IWW phrase from associated signature block content.

Key Challenges in IWW Detection

The web search results highlight why parser logic for IWW detection is challenging:

  • Sometimes contracts include lines like "In witness whereof, the undersigned have executed this agreement," even though there's no actual witness signature on the document, which can raise questions about authenticity or suggest the document is incomplete.
  • A growing number of lawyers and organizations have moved away from "In Witness Whereof" entirely, replacing it with straightforward English or dropping the testimonium clause altogether, with common alternatives including "Signed by the parties on [date]," "Agreed to and signed," or simply placing a date and signature block after the last contractual provision with no introductory phrase at all.

Signature Block Structure

A signature block is the section of a legal document where the parties sign their names to indicate their consent and commitment to the agreement, and it typically includes names, titles, and date lines. The parser's ancestor up-walk logic appears designed to correctly identify these signature regions within hierarchical document structures.

Assessment

The PR's parser improvements are well-targeted for real-world document structure challenges: documents where page layout places IWW and signature blocks in non-obvious hierarchical relationships, and where OCR or page-chrome artifacts (like "[Remainder of page intentionally left blank]") create parsing ambiguities.

🔇 Additional comments (2)
scripts/parse_doc2dict_with_config.py (1)

4321-4325: LGTM!

Also applies to: 4345-4347, 4370-4371

data/auto_parse/level_freeze/state.json (1)

15-16: LGTM!

Also applies to: 203-208

Comment on lines +4210 to 4212
body = (r.get("body_direct") or "").strip()
if _is_iww_clause(span) or _is_iww_clause(body):
r["depth"] = 1 + (r.get("subdoc_penalty") or 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply span-or-body IWW guard in the /s/ PASS-3 branch too.

This change widens IWW detection here, but Line 4535 still uses span-only exclusion. In documents where IWW is in body_direct and title has filler text, the /s/ branch can still demote the IWW carrier to L2.

Suggested fix
-        if _is_iww_clause(_span_text(r)):
+        r_body = (r.get("body_direct") or "").strip()
+        if _is_iww_clause(_span_text(r)) or _is_iww_clause(r_body):
             continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 4210 - 4212, In the /s/
PASS-3 branch currently using a span-only IWW check, update the conditional to
use the same span-or-body guard as earlier: replace uses of _is_iww_clause(span)
with (_is_iww_clause(span) or _is_iww_clause(body)) (reusing the existing body =
(r.get("body_direct") or "").strip() value) so that the branch that sets
r["depth"] for PASS-3 respects IWW found in body_direct as well; keep the rest
of the branch logic (r["depth"] assignments and subdoc_penalty) unchanged.

Comment on lines +4265 to +4270
cur_pid = iww.get("parent_node_id")
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:
iww_seen.add(cur_pid)
parent_rec = by_node_id.get(cur_pid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle root-level IWW carriers in the up-walk seed.

If an IWW carrier has parent_node_id=None, the current loop never adds a search base, so root-level sig-shape siblings are never discovered in the IWW-only (no /s/) path.

Suggested fix
         sig_area_parent_ids: set[int | None] = set()
         for iww in iww_carriers:
             cur_pid = iww.get("parent_node_id")
+            if cur_pid is None:
+                # Root-level IWW: scan root siblings as signature-area candidates.
+                sig_area_parent_ids.add(None)
+                continue
             iww_seen: set[int | None] = set()
             walked = 0
             while cur_pid is not None and cur_pid not in iww_seen and walked < 4:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cur_pid = iww.get("parent_node_id")
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:
iww_seen.add(cur_pid)
parent_rec = by_node_id.get(cur_pid)
cur_pid = iww.get("parent_node_id")
if cur_pid is None:
# Root-level IWW: scan root siblings as signature-area candidates.
sig_area_parent_ids.add(None)
continue
iww_seen: set[int | None] = set()
walked = 0
while cur_pid is not None and cur_pid not in iww_seen and walked < 4:
iww_seen.add(cur_pid)
parent_rec = by_node_id.get(cur_pid)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 4265 - 4270, The up-walk
loop starting from cur_pid = iww.get("parent_node_id") skips adding a search
base when parent_node_id is None, so root-level IWW carriers never seed sibling
discovery; fix by detecting if iww.get("parent_node_id") is None before the
while and explicitly add the IWW's node id (e.g., iww.get("node_id")) or an
appropriate root search key into the up-walk seed collection so root-level
sig-shape siblings are included, then proceed with the existing while using
cur_pid, iww_seen, walked and by_node_id as before.

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant