Skip to content

idx=4: freeze (77 records) — mixed-case corp-suffix party label detection#77

Open
arthrod wants to merge 1 commit into
redo/idx-3from
redo/idx-4
Open

idx=4: freeze (77 records) — mixed-case corp-suffix party label detection#77
arthrod wants to merge 1 commit into
redo/idx-3from
redo/idx-4

Conversation

@arthrod
Copy link
Copy Markdown
Owner

@arthrod arthrod commented May 17, 2026

User description

Summary

Fifth stacked PR. Adds idx=4 (INDEMNIFICATION AGREEMENT, ULURU Inc. + Arindam Bose — same template as idx=0, different Indemnitee) as the fifth verified frozen baseline on top of idx=3 (PR #76).

Fixes a sig-page detection bug that was leaking the Company sig block fragments as L1 mid-document records.

Parser changes (1 surgical, shape-driven)

_CORP_SUFFIX_LABEL_RE (new) — SHAPE detector for mixed-case corporate names with entity suffixes:

# matches: "ULURU Inc.", "Acme Corp", "Foo Holdings LLC", "Bar L.P.", etc.
# entity suffixes: Inc./Inc, Corp./Corp, LLC, L.P./LP, Ltd./Ltd, Limited,
#                  Co./Co, Company, N.A., S.A., GmbH, AG, PLC, LLP

No company names encoded.

UP-climb in _explode_signature_block_lines — now matches EITHER strict ALL-CAPS _SIG_BLOCK_LABEL_RE (already existed) OR the new corp-suffix shape. Tracks claimed parents in sig_block_parents.

DOWN-expansion — walks from BOTH /s/ carriers AND sig-block parents, catching siblings of the carrier that doc2dict split off as separate nodes (e.g. "By" alone, separated from "/s/ Terrance K. Wallberg…").

Verified output for idx=4

  • 77 records, distribution {L0:1, L1:24, L2:46, L3:6}
  • Reconstruction: word_coverage 99.3%, char_ratio 99.4%
  • Max depth: 3

Top-level structure

o=0  L0: INDEMNIFICATION AGREEMENT
o=1  L1: THIS INDEMNIFICATION AGREEMENT (the "Agreement") is made... between ULURU Inc... and Arindam Bose...
o=2  L1: WITNESSETH THAT: / WHEREAS, ... / NOW, THEREFORE, ...
o=3-69 L1: Sections 1-21 (numbered top-level body clauses)
o=70 L1: IN WITNESS WHEREOF, the parties hereto have executed...
o=71 L2: ULURU Inc.                ← was leaking to L1 mid-document before fix
o=72 L2: By                          ← was leaking to L1 mid-document before fix
o=73 L2: :/s/ Terrance K. Wallberg... / Name: Terrance K. Wallberg / Title: Vice President & Chief Financial Officer
o=74 L2: INDEMNITEE
o=75 L2: /s/ Arindam Bose_________________________
o=76 L2: Arindam Bose / Address:

Test plan

  • uv run scripts/parse_doc2dict_with_config.py --limit 5 --no-truncate --output-dir data/auto_parse exits 0 with ok 5
  • uv run scripts/level_loop/freeze.py 4 --force reports word_coverage ≥ 90% (99.3%)
  • uv run scripts/level_loop/regress.py reports all 5 frozen idxs OK
  • Inspector verified no mid-document "ULURU Inc." or "By" L1 leak; both at L2 in sig area as required

Source

http://www.sec.gov/Archives/edgar/data/1168220/000116822017000020/ex_10-26.htm

Why this matters for the corpus

The corp-suffix detector handles a very common SEC filing pattern: party labels using entity suffixes (Inc., LLC, Corp., L.P., etc.) instead of pure ALL-CAPS. Any subsequent agreement with a mixed-case corporate party label will be correctly identified as a sig-block parent, preventing mid-document leaks.

🤖 Generated with Claude Code


CodeAnt-AI Description

Detect mixed-case company signature blocks and keep split signature fragments together

What Changed

  • Signature pages now recognize company names with mixed case and a corporate suffix, such as “ULURU Inc.”, as signature block labels
  • Split signature-page fragments like “By”, “Name:”, and “Title:” are now captured together with the related signer instead of being left behind
  • This fixes missing signature lines in company sign-off blocks and improves record reconstruction on agreements with mixed-case corporate party names

Impact

✅ Fewer missing signature lines
✅ More complete agreement parsing
✅ Cleaner company sign-off extraction

🔄 Retrigger CodeAnt AI Review

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…ling-fragment sig expansion

idx=4 (ULURU Inc. INDEMNIFICATION AGREEMENT EX-10.26, second of two
Indemnification templates in the corpus) emits 77 records: 1 L0, 24 L1
(preamble, recitals, 21 numbered Sections, IWW operating clause),
46 L2 (lettered subsections + sig page lines per doc2dict natural
grouping), 6 L3 (roman items under "Change in Control"). Reconstruction
99.3% word coverage, 99.4% char ratio. All 5 frozen idxs OK.

Parser changes — purely SHAPE-based, two surgical additions to
`_explode_signature_block_lines`:

1. _CORP_SUFFIX_LABEL_RE — new shape detector for mixed-case corporate
   party labels (e.g. "ULURU Inc.", "Acme Corp.", "Foo Bar LLC"). The
   existing _SIG_BLOCK_LABEL_RE is strict ALL-CAPS and misses these.
   The new pattern is structural — uppercase-leading proper-noun prefix
   followed by a corporate entity suffix (Inc./Inc, Corp./Corp, LLC,
   L.P./LP, Ltd./Ltd, Limited, Co./Co, Company, N.A., S.A., GmbH, AG,
   PLC, LLP). No specific company names are encoded.

2. Sibling-fragment DOWN-expansion — when the UP-climb claims a parent
   as a sig-block label, that parent's node_id is tracked. The DOWN-
   expansion walks from BOTH /s/ carriers AND sig-block parents,
   catching SIBLINGS of the carrier under the same parent (e.g. a
   separate "By" fragment that doc2dict split off into its own
   predicted-header node).

Root cause for idx=4: doc2dict gave the Company sig block as
three sibling nodes under one parent — nid=64 "ULURU Inc.", nid=65
"By", nid=66 "/s/ Terrance K. Wallberg... | Name:... | Title:...".
Before the fix, the UP-climb failed to claim nid=64 because the
strict ALL-CAPS regex rejected "ULURU Inc." (mixed case), so neither
"ULURU Inc." nor its sibling "By" were marked as sig lines. They
remained L1 records mid-document between Section 17 (Notices) and
Section 18 (Counterparts). After the fix, nid=64 is claimed via the
corp-suffix shape, DOWN-expansion from nid=64 catches the sibling
"By" (nid=65), and the sig-line consolidation pass moves the whole
block to its natural position after the IWW operating clause at L2.

No regressions: idx=0 (75 records), idx=1 (532 records), idx=2 (422
records), idx=3 (102 records) all still pass freeze + regress with
the same record counts as before.
@blocksorg
Copy link
Copy Markdown

blocksorg Bot commented May 17, 2026

Mention Blocks like a regular teammate with your question or request:

@blocks review this pull request
@blocks make the following changes ...
@blocks create an issue from what was mentioned in the following comment ...
@blocks explain the following code ...
@blocks are there any security or performance concerns?

Run @blocks /help for more information.

Workspace settings | Disable this message

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arthrod! 👋

Your private repo does not have access to Sourcery.

Please upgrade to continue using Sourcery ✨

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI is reviewing your PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Improvements

    • Enhanced parsing of signature blocks in legal documents, with improved recognition of mixed-case corporate party labels (e.g., "ULURU Inc.", "Acme Corp.") and more comprehensive capture of signature-related content fragments across document pages.
  • Data

    • Added new parsed Indemnification Agreement document containing 21 sections plus execution details.

Walkthrough

Parser improvements to recognize mixed-case corporate party labels in signature-page detection, paired with a data freeze of 77 processed Indemnification Agreement segments. The parser changes introduce a new regex pattern, tracking mechanism, and expanded ancestor/descendant logic to better capture signature-block content split across multiple DOM nodes. State and data artifacts are updated to record idx_4 as frozen.

Changes

Signature-page shape detection and Indemnification Agreement data freeze

Layer / File(s) Summary
Corporate-suffix label regex and signature-block parent tracking
scripts/parse_doc2dict_with_config.py
Introduces _CORP_SUFFIX_LABEL_RE regex to match mixed-case corporate labels (Inc, Corp, LLC, Ltd, etc.) alongside the existing strict ALL-CAPS detector. Adds sig_block_parents set to track ancestor nodes identified during upward expansion, enabling broader descendant capture.
Upward/downward signature-block expansion logic
scripts/parse_doc2dict_with_config.py
Party-label ancestor detection now accepts either ALL-CAPS or corporate-suffix label shapes with body-empty validation. Downward descendant walk expands from both /s/ carriers and recorded signature-block parents. Sig-line looks_sig predicate now recognizes corporate-suffix labels as valid triggers.
Indemnification Agreement data freeze and state update
data/auto_parse/level_freeze/frozen/idx_4.jsonl, data/auto_parse/level_freeze/state.json
Freezes 77 parsed Indemnification Agreement segments (ordered with nesting levels) into idx_4.jsonl. Updates state.json to add idx 4 to frozen array and appends freeze history record with timestamp 2026-05-17T06:08:40, idx, and segment count.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • arthrod/clause-extract#50: Both PRs update the same freezing state artifacts by appending freeze entries and extending the frozen list, though targeting different idx values.
  • arthrod/clause-extract#17: Both PRs follow the same level-freeze workflow pattern by adding a new idx_*.jsonl dataset and extending state.json metadata, even though they target different indices.

Suggested labels

Feat2

Poem

🐰 A corporate label, mixed with care,
Now recognized throughout the air,
With regex wings, we catch each name—
From "ACME" strict to "Inc." the same.
And frozen indices, stacked with grace,
Find their home in this parsing place.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main changes: adding idx=4 as a frozen baseline and implementing mixed-case corporate suffix party label detection in the signature block parser.
Description check ✅ Passed The description provides detailed context about adding idx=4 as a verified frozen baseline and explains the parser bug fix for mixed-case corporate party label detection in signature blocks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai Bot added the size:L This PR changes 100-499 lines, ignoring generated files label May 17, 2026
@coderabbitai coderabbitai Bot added the Feat2 label May 17, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves signature block parsing by introducing the _CORP_SUFFIX_LABEL_RE regex to identify mixed-case corporate names and updating the detection logic to include siblings of signature carriers. These changes ensure that related fields like 'By' and 'Title' are correctly associated with the signature block. The reviewer suggested adding 'Corporation' to the list of corporate suffixes to further improve detection accuracy.

# encoded — only the structural suffix shape.
_CORP_SUFFIX_LABEL_RE = re.compile(
r"^[A-Z][A-Za-z0-9 .,&'\-]{0,80}\s+"
r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The suffix Corporation is missing from the list of corporate entity suffixes in _CORP_SUFFIX_LABEL_RE, although Corp and Limited are included. Adding Corporation would improve detection for mixed-case corporate names that use the full word in signature blocks.

Suggested change
r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)"
r"(?:Inc|Corp|Corporation|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)"

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 3308-3312: The corp-suffix regex (_CORP_SUFFIX_LABEL_RE) and the
existing _SIG_BLOCK_LABEL_RE are being used in multiple places (notably
_explode_signature_block_lines and _looks_like_sig_page_line) which causes
duplicated/ drifting logic; create a single helper function named
_is_sig_block_label(text: str) that encapsulates the combined matching logic
(use both _SIG_BLOCK_LABEL_RE and _CORP_SUFFIX_LABEL_RE as appropriate) and
replace direct regex checks in _explode_signature_block_lines,
_looks_like_sig_page_line, and any other sig-label checks (e.g., the occurrences
referenced around the other checks) to call _is_sig_block_label so all
label-matching logic is centralized and consistent.
- Around line 3550-3556: Add a regression fixture that exercises the
"split-company signature" shape (mixed-case corporate parent like "ULURU Inc."
followed by "/s/" and separated "By"/name fragments) so future changes to the
UP-climb and DOWN-walk keep behavior stable: create a small test input and
expected output asserting that the UP-climb claims the mixed-case parent (the
logic that populates sig_block_parents) and that the PASS 2.5 DOWN-walk starting
from those claimed parents recovers sibling fragments (e.g., "By", person name,
title) as separate nodes; place the fixture alongside the existing parser
regression tests and add assertions targeting the sig_block_parents usage and
the final parsed structure to ensure the parent is claimed and the split
siblings are produced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c2a86068-a5da-49c7-8d97-c29c3fbecadd

📥 Commits

Reviewing files that changed from the base of the PR and between 7c6de32 and 81ad76f.

📒 Files selected for processing (3)
  • data/auto_parse/level_freeze/frozen/idx_4.jsonl
  • data/auto_parse/level_freeze/state.json
  • scripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run <cli> --help, assert exit code 0. Fail if smoke test fails.
Run uv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -q for Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Run uv run ruff check . --diff for Python linting. Fail if exit code is non-zero and list each violation.
Run uv run ruff format --check --diff . for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Run uv run ruff check --select I,F401 . to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite: uv run pytest --tb=line -q on origin/main to capture baseline pass/fail counts, and uv run pytest --tb=short -q on PR branch. Fail immediately if exit code is non-zero.
Run uv run typy check for Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new bare type: ignore comments (without error codes) in Python files and cast() calls without explanatory comments. Warn for each. Fail if bare type: ignore count > 3.

Files:

  • scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (Custom checks)

For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.

Files:

  • scripts/parse_doc2dict_with_config.py
🔍 Remote MCP

Based on my search results, I can now provide you with relevant context for reviewing this PR. Let me also note that I do not have access to the "Stitch" design tools mentioned in your requirements—those tools are not available in my current toolset (which includes Sentry, Context7, DeepWiki, and web search capabilities).


Relevant Context Summary

SEC EDGAR Document Structure & Signature Blocks

The PR is working with SEC EDGAR documents, which require retrieving, parsing, and extracting content and metadata from filing documents. Signature blocks in SEC filings are a critical structural element that need proper detection and extraction.

In SEC filings, securities laws require filed documents to be signed, and signatures are typically typed as text equivalents for electronic filing. Signature pages in SEC documents commonly contain:

  • Company/party names (often in mixed-case format with corporate entity suffixes like "Inc.", "Corp.", "LLC")
  • Title/position information
  • Signature placeholders (e.g., /s/ carriers)
  • Signatory names

Document Parsing Challenges

The PR addresses a real and common pattern in SEC EDGAR filings: SEC EDGAR documents need to be parsed into a tree of elements that correspond to the visual structure of the document, with semantic elements including section titles, paragraphs, and tables, forming a semantic tree that corresponds to the visual and informational structure of the document.

The bug fixed by this PR stems from doc2dict limitations: When documents are parsed by the doc2dict library, signature block fragments can become fragmented into sibling nodes. The existing ALL-CAPS regex (_SIG_BLOCK_LABEL_RE) would miss mixed-case corporate entity labels like "ULURU Inc." or "Acme Corp.", causing those fragments to leak into the mid-document L1 level instead of being correctly grouped in the signature section at L2.

Validation & Regression Context

The PR includes strong validation metrics:

  • 77 records with correct hierarchical distribution: {L0:1, L1:24, L2:46, L3:6}
  • Word coverage: 99.3% (exceeds the 90% threshold)
  • Character ratio: 99.4% (near-perfect reconstruction)
  • All 5 frozen baselines pass regression tests (idx_0 through idx_4)

Parser Enhancement Details

The solution adds two complementary detection mechanisms:

  1. New _CORP_SUFFIX_LABEL_RE: A shape-based regex that detects mixed-case corporate names followed by common entity suffixes (Inc, Corp, LLC, L.P., Ltd, Limited, Co, Company, N.A., S.A., GmbH, AG, PLC, LLP) — without encoding specific company names, making it reusable across different filing sources.

  2. Enhanced traversal logic:

    • UP-climb now tracks both ALL-CAPS and corp-suffix matches in sig_block_parents
    • DOWN-expansion now walks from both /s/ carriers AND identified sig_block_parents to capture sibling fragments that doc2dict may have split off (e.g., separated "By" nodes)

Related Work

Related PRs (#50 and #17) also update the level-freeze artifacts with new frozen baselines, but the main PR uniquely includes the parser regex and signature-shape code changes needed to handle mixed-case entity suffixes.


Note on Stitch Tools: The Stitch design service mentioned in your user requirements is not available in my current toolset. The available tools are limited to Sentry (error tracking), Context7 (library documentation), DeepWiki (GitHub repository analysis), and web search. If you need design generation for this PR review, you would need to access Stitch directly through its own interface.

🔇 Additional comments (2)
data/auto_parse/level_freeze/frozen/idx_4.jsonl (1)

1-77: LGTM!

data/auto_parse/level_freeze/state.json (1)

7-8: LGTM!

Also applies to: 129-134

Comment on lines +3308 to +3312
_CORP_SUFFIX_LABEL_RE = re.compile(
r"^[A-Z][A-Za-z0-9 .,&'\-]{0,80}\s+"
r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)"
r"\.?$"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Centralize sig-block label matching.

The new corp-suffix shape is wired into _explode_signature_block_lines, but the module still has other sig-label checks to keep in sync. _looks_like_sig_page_line() still only knows about _SIG_BLOCK_LABEL_RE, so this logic has already started to drift. Please extract a single helper like _is_sig_block_label() and reuse it here.

♻️ Suggested consolidation
+def _is_sig_block_label(text: str) -> bool:
+    text = (text or "").strip()
+    return bool(
+        _SIG_BLOCK_LABEL_RE.match(text)
+        or _CORP_SUFFIX_LABEL_RE.match(text)
+    )
+
 def _looks_like_sig_page_line(span: str) -> bool:
@@
-    if _SIG_BLOCK_LABEL_RE.match(span):
+    if _is_sig_block_label(span):
         return True
     return False
@@
-                    _SIG_BLOCK_LABEL_RE.match(p_title)
-                    or _CORP_SUFFIX_LABEL_RE.match(p_title)
+                    _is_sig_block_label(p_title)
@@
-                or (d_title and _SIG_BLOCK_LABEL_RE.match(d_title) and not d_body)
-                or (d_title and _CORP_SUFFIX_LABEL_RE.match(d_title) and not d_body)
+                or (d_title and _is_sig_block_label(d_title) and not d_body)

As per coding guidelines, duplicate code (copy/paste, similar logic, abstractions) should be addressed.

Also applies to: 3581-3586, 3651-3653

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 3308 - 3312, The
corp-suffix regex (_CORP_SUFFIX_LABEL_RE) and the existing _SIG_BLOCK_LABEL_RE
are being used in multiple places (notably _explode_signature_block_lines and
_looks_like_sig_page_line) which causes duplicated/ drifting logic; create a
single helper function named _is_sig_block_label(text: str) that encapsulates
the combined matching logic (use both _SIG_BLOCK_LABEL_RE and
_CORP_SUFFIX_LABEL_RE as appropriate) and replace direct regex checks in
_explode_signature_block_lines, _looks_like_sig_page_line, and any other
sig-label checks (e.g., the occurrences referenced around the other checks) to
call _is_sig_block_label so all label-matching logic is centralized and
consistent.

Comment on lines +3550 to +3556
# Track which records were claimed as sig-block PARENTS during the
# UP-climb so PASS 2.5 can expand DOWN from them to catch siblings
# of the carrier under the same parent (mixed-case corporate party
# labels often parent a /s/ carrier plus separate "By"/"Name:"/
# "Title:" sibling fragments).
sig_block_parents: set[int] = set()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a regression fixture for the split-company signature shape.

This fix depends on two pieces staying aligned: the UP-climb claiming the mixed-case corporate parent, and the DOWN-walk starting from that claimed parent to recover split siblings like By. A small parser regression case covering ULURU Inc. + /s/ + separated By/name fragments would make future regex or traversal tweaks much safer.

Also applies to: 3626-3635

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/parse_doc2dict_with_config.py` around lines 3550 - 3556, Add a
regression fixture that exercises the "split-company signature" shape
(mixed-case corporate parent like "ULURU Inc." followed by "/s/" and separated
"By"/name fragments) so future changes to the UP-climb and DOWN-walk keep
behavior stable: create a small test input and expected output asserting that
the UP-climb claims the mixed-case parent (the logic that populates
sig_block_parents) and that the PASS 2.5 DOWN-walk starting from those claimed
parents recovers sibling fragments (e.g., "By", person name, title) as separate
nodes; place the fixture alongside the existing parser regression tests and add
assertions targeting the sig_block_parents usage and the final parsed structure
to ensure the parent is claimed and the split siblings are produced.

Comment on lines +3308 to +3312
_CORP_SUFFIX_LABEL_RE = re.compile(
r"^[A-Z][A-Za-z0-9 .,&'\-]{0,80}\s+"
r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)"
r"\.?$"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The new corporate-suffix detector is case-sensitive, so mixed-case party names with uppercase suffixes (for example, Acme INC. or Foo LTD) will not match and their signature-block parents will be missed. That causes the intended sibling-capture fix to fail for a common formatting variant. Make the suffix match case-insensitive (or explicitly support uppercase forms) so corporate labels are detected consistently. [incorrect condition logic]

Severity Level: Major ⚠️
- ❌ Corporate labels like "ACME INC." not treated as sig parents.
- ⚠️ Sibling "By/Name/Title" lines stay at incorrect depths.
- ⚠️ Signature-page segmentation around such parties becomes inconsistent.
Steps of Reproduction ✅
1. Run the parser CLI `scripts/parse_doc2dict_with_config.py` via `main()` (defined at
`scripts/parse_doc2dict_with_config.py:66-100`) or indirectly through
`scripts/level_loop/freeze.py` which invokes this script (see `PARSER_SRC` at
`scripts/level_loop/freeze.py:43` and the `uv run ... parse_doc2dict_with_config.py`
command at `scripts/level_loop/freeze.py:630-653`).

2. Ensure the parsed agreement contains a signature-page party label node whose title is a
mixed-case company name with an uppercase suffix, for example `Acme INC.` or `Foo LTD`,
and whose `body_direct` is empty; this becomes one of the `rows` records passed into
`_explode_signature_block_lines()` at `scripts/parse_doc2dict_with_config.py:3400-427` as
part of the `sections` pipeline in `parse_one()` (see `sections =
_explode_signature_block_lines(sections)` at
`scripts/parse_doc2dict_with_config.py:3958`).

3. During PASS 2 UP-climb in `_explode_signature_block_lines()`, the ancestor title is
checked against `_SIG_BLOCK_LABEL_RE` and `_CORP_SUFFIX_LABEL_RE` in the party-label
condition at `scripts/parse_doc2dict_with_config.py:322-333`; `_SIG_BLOCK_LABEL_RE` only
matches strict ALL-CAPS, and `_CORP_SUFFIX_LABEL_RE` (defined at
`scripts/parse_doc2dict_with_config.py:3308-3312`) is case-sensitive and only recognizes
`Inc`, `Ltd`, `Co`, etc. with the exact casing shown, so titles ending in `INC.`, `LTD`,
or `CO.` do not match either pattern and are never added to `sig_block_parents`.

4. Because the mixed-case corporate parent is not recorded in `sig_block_parents`, the
DOWN-expansion loop at `scripts/parse_doc2dict_with_config.py:372-399` only walks
descendants from `/s/` carriers (not from the parent), so sibling fragments under the same
parent—such as a separate `By` node doc2dict split off from the `/s/` line as described in
the comment at `scripts/parse_doc2dict_with_config.py:367-371`—are never visited, never
satisfy the `looks_sig` check, and thus are omitted from `sig_line_node_ids`; PASS 3 at
`scripts/parse_doc2dict_with_config.py:400-425` therefore fails to pin these sibling
signature lines to depth 2, leaving those lines at incorrect depths in the final
`sections` output.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 3308:3312
**Comment:**
	*Incorrect Condition Logic: The new corporate-suffix detector is case-sensitive, so mixed-case party names with uppercase suffixes (for example, `Acme INC.` or `Foo LTD`) will not match and their signature-block parents will be missed. That causes the intended sibling-capture fix to fail for a common formatting variant. Make the suffix match case-insensitive (or explicitly support uppercase forms) so corporate labels are detected consistently.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

Comment on lines +3632 to 3635
walk_roots.update(sig_block_parents)
for root_nid in walk_roots:
for d in _walk_descendants(root_nid):
if d.get("is_envelope") or d.get("scope") == "trailer":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Expanding downward from every claimed parent via full descendant traversal is broader than the stated sibling-only fix and can pull unrelated nodes into signature classification under the same parent subtree. Restrict the parent-based expansion to immediate children (or re-check chain relation to a carrier) to avoid demoting non-signature content to L2. [logic error]

Severity Level: Major ⚠️
- ❌ Non-signature descendants under sig-block parents reclassified as signature.
- ⚠️ Some substantive clauses demoted to flat L2 signature level.
- ⚠️ Reconstruction around sig blocks can include unintended extra content.
Steps of Reproduction ✅
1. Parse an agreement through `parse_one()`
(`scripts/parse_doc2dict_with_config.py:3847-3991`), either directly or via the CLI
`main()` (`scripts/parse_doc2dict_with_config.py:66-100`) as invoked in
`scripts/level_loop/freeze.py:630-653`, so that `_explode_signature_block_lines(sections)`
is applied at `scripts/parse_doc2dict_with_config.py:3958` to the `sections` list.

2. In the resulting `rows` passed to `_explode_signature_block_lines()`
(`scripts/parse_doc2dict_with_config.py:3400-427`), assume there is a corporate
party-label ancestor whose title matches `_CORP_SUFFIX_LABEL_RE` (for example `ULURU
Inc.`) and has an empty or sig-shaped body, so that it satisfies the party-label condition
at `scripts/parse_doc2dict_with_config.py:317-333` and its `node_id` is added both to
`sig_line_node_ids` and to `sig_block_parents`
(`scripts/parse_doc2dict_with_config.py:334-335`).

3. Also assume that under this same parent there exists a deeper descendant node
representing non-signature content (for example a short header like `Acknowledgment` or
another bare-name predicted header with no enumeration and empty `body_direct`), so that
it is reachable via the tree from the parent but is not conceptually part of the signature
block; during the DOWN-expansion, `walk_roots` is built from both `/s/` carriers and
`sig_block_parents` at `scripts/parse_doc2dict_with_config.py:372-373`, and
`_walk_descendants()` (`scripts/parse_doc2dict_with_config.py:353-365`) traverses the full
subtree under the parent, ensuring this non-signature descendant is yielded as `d` in the
loop at `scripts/parse_doc2dict_with_config.py:375-399`.

4. For such a descendant `d` with a non-empty title, no section marker (so
`_has_section_marker_title(d)` at `scripts/parse_doc2dict_with_config.py:196-203` returns
False), and empty body, the `looks_sig` predicate at
`scripts/parse_doc2dict_with_config.py:388-396` evaluates True via the `(d_title and not
d_body)` "bare name as title" branch, causing its `node_id` to be added to
`sig_line_node_ids` at `scripts/parse_doc2dict_with_config.py:397-398`; later, PASS 3 at
`scripts/parse_doc2dict_with_config.py:400-425` reassigns this non-signature record's
`depth` to 2 and marks `_sig_line = True`, effectively misclassifying it as a
signature-line record solely because it sits somewhere in the descendant subtree of a
sig-block parent rather than being a true sibling of a `/s/` carrier.

Fix in Cursor | Fix in VSCode Claude

(Use Cmd/Ctrl + Click for best experience)

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 3632:3635
**Comment:**
	*Logic Error: Expanding downward from every claimed parent via full descendant traversal is broader than the stated sibling-only fix and can pull unrelated nodes into signature classification under the same parent subtree. Restrict the parent-based expansion to immediate children (or re-check chain relation to a carrier) to avoid demoting non-signature content to L2.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix
👍 | 👎

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented May 17, 2026

CodeAnt AI finished reviewing your PR.

@arthrod
Copy link
Copy Markdown
Owner Author

arthrod commented May 17, 2026

Triage agent — PR #77 comment review (read-only pass, no code changes)

5 inline comments reviewed:

  1. gemini-code-assist @ parse_doc2dict...py:3310 — Corporation missing from _CORP_SUFFIX_LABEL_RE (WILL-DEFER)
    Valid: Corp and Limited are present but Corporation is not. Low risk to add. Deferred to polish PR (adding one suffix to a regex constant is safe but requires re-validating sig-block detection across all idxs).

  2. coderabbitai @ parse_doc2dict...py:3312 — centralize sig-block label matching (WILL-DEFER)
    Nitpick/Trivial: _looks_like_sig_page_line() and _explode_signature_block_lines use different label-checking paths. Centralization would reduce drift. No active bug. Deferred to a refactor pass.

  3. coderabbitai @ parse_doc2dict...py:3556 — add regression fixture for split-company sig shape (WILL-DEFER)
    Legitimate test coverage gap. The UP-climb + DOWN-walk alignment for split corporate names has no dedicated fixture. Deferred to a testing polish PR.

  4. codeant-ai @ parse_doc2dict...py:3312 — _CORP_SUFFIX_LABEL_RE is case-sensitive (WILL-DEFER)
    Acme INC. and Foo LTD (all-caps suffix) would not match. Adding re.IGNORECASE to the pattern would fix it. Low risk. Deferred to polish PR alongside the Corporation addition.

  5. codeant-ai @ parse_doc2dict...py:3635 — descendant traversal too broad for sibling-only fix (WILL-DEFER)
    Legitimate scope concern: the DOWN-walk can pull non-sig content under large parent subtrees. Restricting to immediate children or adding a chain-check guard would tighten it. Deferred — requires a concrete failing case to avoid regression.

WILL-DEFER items (5): Add Corporation to corp-suffix regex; add re.IGNORECASE; centralize sig-label matching; add fixture for split-company shape; restrict descendant walk scope.

Triage only — no code changes made this round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feat2 size:L This PR changes 100-499 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant