idx=4: freeze (77 records) — mixed-case corp-suffix party label detection#77
idx=4: freeze (77 records) — mixed-case corp-suffix party label detection#77arthrod wants to merge 1 commit into
Conversation
…ling-fragment sig expansion idx=4 (ULURU Inc. INDEMNIFICATION AGREEMENT EX-10.26, second of two Indemnification templates in the corpus) emits 77 records: 1 L0, 24 L1 (preamble, recitals, 21 numbered Sections, IWW operating clause), 46 L2 (lettered subsections + sig page lines per doc2dict natural grouping), 6 L3 (roman items under "Change in Control"). Reconstruction 99.3% word coverage, 99.4% char ratio. All 5 frozen idxs OK. Parser changes — purely SHAPE-based, two surgical additions to `_explode_signature_block_lines`: 1. _CORP_SUFFIX_LABEL_RE — new shape detector for mixed-case corporate party labels (e.g. "ULURU Inc.", "Acme Corp.", "Foo Bar LLC"). The existing _SIG_BLOCK_LABEL_RE is strict ALL-CAPS and misses these. The new pattern is structural — uppercase-leading proper-noun prefix followed by a corporate entity suffix (Inc./Inc, Corp./Corp, LLC, L.P./LP, Ltd./Ltd, Limited, Co./Co, Company, N.A., S.A., GmbH, AG, PLC, LLP). No specific company names are encoded. 2. Sibling-fragment DOWN-expansion — when the UP-climb claims a parent as a sig-block label, that parent's node_id is tracked. The DOWN- expansion walks from BOTH /s/ carriers AND sig-block parents, catching SIBLINGS of the carrier under the same parent (e.g. a separate "By" fragment that doc2dict split off into its own predicted-header node). Root cause for idx=4: doc2dict gave the Company sig block as three sibling nodes under one parent — nid=64 "ULURU Inc.", nid=65 "By", nid=66 "/s/ Terrance K. Wallberg... | Name:... | Title:...". Before the fix, the UP-climb failed to claim nid=64 because the strict ALL-CAPS regex rejected "ULURU Inc." (mixed case), so neither "ULURU Inc." nor its sibling "By" were marked as sig lines. They remained L1 records mid-document between Section 17 (Notices) and Section 18 (Counterparts). After the fix, nid=64 is claimed via the corp-suffix shape, DOWN-expansion from nid=64 catches the sibling "By" (nid=65), and the sig-line consolidation pass moves the whole block to its natural position after the IWW operating clause at L2. No regressions: idx=0 (75 records), idx=1 (532 records), idx=2 (422 records), idx=3 (102 records) all still pass freeze + regress with the same record counts as before.
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughParser improvements to recognize mixed-case corporate party labels in signature-page detection, paired with a data freeze of 77 processed Indemnification Agreement segments. The parser changes introduce a new regex pattern, tracking mechanism, and expanded ancestor/descendant logic to better capture signature-block content split across multiple DOM nodes. State and data artifacts are updated to record idx_4 as frozen. ChangesSignature-page shape detection and Indemnification Agreement data freeze
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Code Review
This pull request improves signature block parsing by introducing the _CORP_SUFFIX_LABEL_RE regex to identify mixed-case corporate names and updating the detection logic to include siblings of signature carriers. These changes ensure that related fields like 'By' and 'Title' are correctly associated with the signature block. The reviewer suggested adding 'Corporation' to the list of corporate suffixes to further improve detection accuracy.
| # encoded — only the structural suffix shape. | ||
| _CORP_SUFFIX_LABEL_RE = re.compile( | ||
| r"^[A-Z][A-Za-z0-9 .,&'\-]{0,80}\s+" | ||
| r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)" |
There was a problem hiding this comment.
The suffix Corporation is missing from the list of corporate entity suffixes in _CORP_SUFFIX_LABEL_RE, although Corp and Limited are included. Adding Corporation would improve detection for mixed-case corporate names that use the full word in signature blocks.
| r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)" | |
| r"(?:Inc|Corp|Corporation|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)" |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Around line 3308-3312: The corp-suffix regex (_CORP_SUFFIX_LABEL_RE) and the
existing _SIG_BLOCK_LABEL_RE are being used in multiple places (notably
_explode_signature_block_lines and _looks_like_sig_page_line) which causes
duplicated/ drifting logic; create a single helper function named
_is_sig_block_label(text: str) that encapsulates the combined matching logic
(use both _SIG_BLOCK_LABEL_RE and _CORP_SUFFIX_LABEL_RE as appropriate) and
replace direct regex checks in _explode_signature_block_lines,
_looks_like_sig_page_line, and any other sig-label checks (e.g., the occurrences
referenced around the other checks) to call _is_sig_block_label so all
label-matching logic is centralized and consistent.
- Around line 3550-3556: Add a regression fixture that exercises the
"split-company signature" shape (mixed-case corporate parent like "ULURU Inc."
followed by "/s/" and separated "By"/name fragments) so future changes to the
UP-climb and DOWN-walk keep behavior stable: create a small test input and
expected output asserting that the UP-climb claims the mixed-case parent (the
logic that populates sig_block_parents) and that the PASS 2.5 DOWN-walk starting
from those claimed parents recovers sibling fragments (e.g., "By", person name,
title) as separate nodes; place the fixture alongside the existing parser
regression tests and add assertions targeting the sig_block_parents usage and
the final parsed structure to ensure the parent is claimed and the split
siblings are produced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c2a86068-a5da-49c7-8d97-c29c3fbecadd
📒 Files selected for processing (3)
data/auto_parse/level_freeze/frozen/idx_4.jsonldata/auto_parse/level_freeze/state.jsonscripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (Custom checks)
**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run<cli> --help, assert exit code 0. Fail if smoke test fails.
Runuv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -qfor Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Runuv run ruff check . --difffor Python linting. Fail if exit code is non-zero and list each violation.
Runuv run ruff format --check --diff .for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Runuv run ruff check --select I,F401 .to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite:uv run pytest --tb=line -qon origin/main to capture baseline pass/fail counts, anduv run pytest --tb=short -qon PR branch. Fail immediately if exit code is non-zero.
Runuv run typy checkfor Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new baretype: ignorecomments (without error codes) in Python files andcast()calls without explanatory comments. Warn for each. Fail if baretype: ignorecount > 3.
Files:
scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}
📄 CodeRabbit inference engine (Custom checks)
For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.
Files:
scripts/parse_doc2dict_with_config.py
🔍 Remote MCP
Based on my search results, I can now provide you with relevant context for reviewing this PR. Let me also note that I do not have access to the "Stitch" design tools mentioned in your requirements—those tools are not available in my current toolset (which includes Sentry, Context7, DeepWiki, and web search capabilities).
Relevant Context Summary
SEC EDGAR Document Structure & Signature Blocks
The PR is working with SEC EDGAR documents, which require retrieving, parsing, and extracting content and metadata from filing documents. Signature blocks in SEC filings are a critical structural element that need proper detection and extraction.
In SEC filings, securities laws require filed documents to be signed, and signatures are typically typed as text equivalents for electronic filing. Signature pages in SEC documents commonly contain:
- Company/party names (often in mixed-case format with corporate entity suffixes like "Inc.", "Corp.", "LLC")
- Title/position information
- Signature placeholders (e.g.,
/s/carriers) - Signatory names
Document Parsing Challenges
The PR addresses a real and common pattern in SEC EDGAR filings: SEC EDGAR documents need to be parsed into a tree of elements that correspond to the visual structure of the document, with semantic elements including section titles, paragraphs, and tables, forming a semantic tree that corresponds to the visual and informational structure of the document.
The bug fixed by this PR stems from doc2dict limitations: When documents are parsed by the doc2dict library, signature block fragments can become fragmented into sibling nodes. The existing ALL-CAPS regex (_SIG_BLOCK_LABEL_RE) would miss mixed-case corporate entity labels like "ULURU Inc." or "Acme Corp.", causing those fragments to leak into the mid-document L1 level instead of being correctly grouped in the signature section at L2.
Validation & Regression Context
The PR includes strong validation metrics:
- 77 records with correct hierarchical distribution: {L0:1, L1:24, L2:46, L3:6}
- Word coverage: 99.3% (exceeds the 90% threshold)
- Character ratio: 99.4% (near-perfect reconstruction)
- All 5 frozen baselines pass regression tests (idx_0 through idx_4)
Parser Enhancement Details
The solution adds two complementary detection mechanisms:
-
New
_CORP_SUFFIX_LABEL_RE: A shape-based regex that detects mixed-case corporate names followed by common entity suffixes (Inc, Corp, LLC, L.P., Ltd, Limited, Co, Company, N.A., S.A., GmbH, AG, PLC, LLP) — without encoding specific company names, making it reusable across different filing sources. -
Enhanced traversal logic:
- UP-climb now tracks both ALL-CAPS and corp-suffix matches in
sig_block_parents - DOWN-expansion now walks from both
/s/carriers AND identifiedsig_block_parentsto capture sibling fragments thatdoc2dictmay have split off (e.g., separated "By" nodes)
- UP-climb now tracks both ALL-CAPS and corp-suffix matches in
Related Work
Related PRs (#50 and #17) also update the level-freeze artifacts with new frozen baselines, but the main PR uniquely includes the parser regex and signature-shape code changes needed to handle mixed-case entity suffixes.
Note on Stitch Tools: The Stitch design service mentioned in your user requirements is not available in my current toolset. The available tools are limited to Sentry (error tracking), Context7 (library documentation), DeepWiki (GitHub repository analysis), and web search. If you need design generation for this PR review, you would need to access Stitch directly through its own interface.
🔇 Additional comments (2)
data/auto_parse/level_freeze/frozen/idx_4.jsonl (1)
1-77: LGTM!data/auto_parse/level_freeze/state.json (1)
7-8: LGTM!Also applies to: 129-134
| _CORP_SUFFIX_LABEL_RE = re.compile( | ||
| r"^[A-Z][A-Za-z0-9 .,&'\-]{0,80}\s+" | ||
| r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)" | ||
| r"\.?$" | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Centralize sig-block label matching.
The new corp-suffix shape is wired into _explode_signature_block_lines, but the module still has other sig-label checks to keep in sync. _looks_like_sig_page_line() still only knows about _SIG_BLOCK_LABEL_RE, so this logic has already started to drift. Please extract a single helper like _is_sig_block_label() and reuse it here.
♻️ Suggested consolidation
+def _is_sig_block_label(text: str) -> bool:
+ text = (text or "").strip()
+ return bool(
+ _SIG_BLOCK_LABEL_RE.match(text)
+ or _CORP_SUFFIX_LABEL_RE.match(text)
+ )
+
def _looks_like_sig_page_line(span: str) -> bool:
@@
- if _SIG_BLOCK_LABEL_RE.match(span):
+ if _is_sig_block_label(span):
return True
return False
@@
- _SIG_BLOCK_LABEL_RE.match(p_title)
- or _CORP_SUFFIX_LABEL_RE.match(p_title)
+ _is_sig_block_label(p_title)
@@
- or (d_title and _SIG_BLOCK_LABEL_RE.match(d_title) and not d_body)
- or (d_title and _CORP_SUFFIX_LABEL_RE.match(d_title) and not d_body)
+ or (d_title and _is_sig_block_label(d_title) and not d_body)As per coding guidelines, duplicate code (copy/paste, similar logic, abstractions) should be addressed.
Also applies to: 3581-3586, 3651-3653
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 3308 - 3312, The
corp-suffix regex (_CORP_SUFFIX_LABEL_RE) and the existing _SIG_BLOCK_LABEL_RE
are being used in multiple places (notably _explode_signature_block_lines and
_looks_like_sig_page_line) which causes duplicated/ drifting logic; create a
single helper function named _is_sig_block_label(text: str) that encapsulates
the combined matching logic (use both _SIG_BLOCK_LABEL_RE and
_CORP_SUFFIX_LABEL_RE as appropriate) and replace direct regex checks in
_explode_signature_block_lines, _looks_like_sig_page_line, and any other
sig-label checks (e.g., the occurrences referenced around the other checks) to
call _is_sig_block_label so all label-matching logic is centralized and
consistent.
| # Track which records were claimed as sig-block PARENTS during the | ||
| # UP-climb so PASS 2.5 can expand DOWN from them to catch siblings | ||
| # of the carrier under the same parent (mixed-case corporate party | ||
| # labels often parent a /s/ carrier plus separate "By"/"Name:"/ | ||
| # "Title:" sibling fragments). | ||
| sig_block_parents: set[int] = set() | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Add a regression fixture for the split-company signature shape.
This fix depends on two pieces staying aligned: the UP-climb claiming the mixed-case corporate parent, and the DOWN-walk starting from that claimed parent to recover split siblings like By. A small parser regression case covering ULURU Inc. + /s/ + separated By/name fragments would make future regex or traversal tweaks much safer.
Also applies to: 3626-3635
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 3550 - 3556, Add a
regression fixture that exercises the "split-company signature" shape
(mixed-case corporate parent like "ULURU Inc." followed by "/s/" and separated
"By"/name fragments) so future changes to the UP-climb and DOWN-walk keep
behavior stable: create a small test input and expected output asserting that
the UP-climb claims the mixed-case parent (the logic that populates
sig_block_parents) and that the PASS 2.5 DOWN-walk starting from those claimed
parents recovers sibling fragments (e.g., "By", person name, title) as separate
nodes; place the fixture alongside the existing parser regression tests and add
assertions targeting the sig_block_parents usage and the final parsed structure
to ensure the parent is claimed and the split siblings are produced.
| _CORP_SUFFIX_LABEL_RE = re.compile( | ||
| r"^[A-Z][A-Za-z0-9 .,&'\-]{0,80}\s+" | ||
| r"(?:Inc|Corp|LLC|L\.?P|Ltd|Limited|Co|Company|N\.A|S\.A|GmbH|AG|PLC|LLP)" | ||
| r"\.?$" | ||
| ) |
There was a problem hiding this comment.
Suggestion: The new corporate-suffix detector is case-sensitive, so mixed-case party names with uppercase suffixes (for example, Acme INC. or Foo LTD) will not match and their signature-block parents will be missed. That causes the intended sibling-capture fix to fail for a common formatting variant. Make the suffix match case-insensitive (or explicitly support uppercase forms) so corporate labels are detected consistently. [incorrect condition logic]
Severity Level: Major ⚠️
- ❌ Corporate labels like "ACME INC." not treated as sig parents.
- ⚠️ Sibling "By/Name/Title" lines stay at incorrect depths.
- ⚠️ Signature-page segmentation around such parties becomes inconsistent.Steps of Reproduction ✅
1. Run the parser CLI `scripts/parse_doc2dict_with_config.py` via `main()` (defined at
`scripts/parse_doc2dict_with_config.py:66-100`) or indirectly through
`scripts/level_loop/freeze.py` which invokes this script (see `PARSER_SRC` at
`scripts/level_loop/freeze.py:43` and the `uv run ... parse_doc2dict_with_config.py`
command at `scripts/level_loop/freeze.py:630-653`).
2. Ensure the parsed agreement contains a signature-page party label node whose title is a
mixed-case company name with an uppercase suffix, for example `Acme INC.` or `Foo LTD`,
and whose `body_direct` is empty; this becomes one of the `rows` records passed into
`_explode_signature_block_lines()` at `scripts/parse_doc2dict_with_config.py:3400-427` as
part of the `sections` pipeline in `parse_one()` (see `sections =
_explode_signature_block_lines(sections)` at
`scripts/parse_doc2dict_with_config.py:3958`).
3. During PASS 2 UP-climb in `_explode_signature_block_lines()`, the ancestor title is
checked against `_SIG_BLOCK_LABEL_RE` and `_CORP_SUFFIX_LABEL_RE` in the party-label
condition at `scripts/parse_doc2dict_with_config.py:322-333`; `_SIG_BLOCK_LABEL_RE` only
matches strict ALL-CAPS, and `_CORP_SUFFIX_LABEL_RE` (defined at
`scripts/parse_doc2dict_with_config.py:3308-3312`) is case-sensitive and only recognizes
`Inc`, `Ltd`, `Co`, etc. with the exact casing shown, so titles ending in `INC.`, `LTD`,
or `CO.` do not match either pattern and are never added to `sig_block_parents`.
4. Because the mixed-case corporate parent is not recorded in `sig_block_parents`, the
DOWN-expansion loop at `scripts/parse_doc2dict_with_config.py:372-399` only walks
descendants from `/s/` carriers (not from the parent), so sibling fragments under the same
parent—such as a separate `By` node doc2dict split off from the `/s/` line as described in
the comment at `scripts/parse_doc2dict_with_config.py:367-371`—are never visited, never
satisfy the `looks_sig` check, and thus are omitted from `sig_line_node_ids`; PASS 3 at
`scripts/parse_doc2dict_with_config.py:400-425` therefore fails to pin these sibling
signature lines to depth 2, leaving those lines at incorrect depths in the final
`sections` output.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 3308:3312
**Comment:**
*Incorrect Condition Logic: The new corporate-suffix detector is case-sensitive, so mixed-case party names with uppercase suffixes (for example, `Acme INC.` or `Foo LTD`) will not match and their signature-block parents will be missed. That causes the intended sibling-capture fix to fail for a common formatting variant. Make the suffix match case-insensitive (or explicitly support uppercase forms) so corporate labels are detected consistently.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| walk_roots.update(sig_block_parents) | ||
| for root_nid in walk_roots: | ||
| for d in _walk_descendants(root_nid): | ||
| if d.get("is_envelope") or d.get("scope") == "trailer": |
There was a problem hiding this comment.
Suggestion: Expanding downward from every claimed parent via full descendant traversal is broader than the stated sibling-only fix and can pull unrelated nodes into signature classification under the same parent subtree. Restrict the parent-based expansion to immediate children (or re-check chain relation to a carrier) to avoid demoting non-signature content to L2. [logic error]
Severity Level: Major ⚠️
- ❌ Non-signature descendants under sig-block parents reclassified as signature.
- ⚠️ Some substantive clauses demoted to flat L2 signature level.
- ⚠️ Reconstruction around sig blocks can include unintended extra content.Steps of Reproduction ✅
1. Parse an agreement through `parse_one()`
(`scripts/parse_doc2dict_with_config.py:3847-3991`), either directly or via the CLI
`main()` (`scripts/parse_doc2dict_with_config.py:66-100`) as invoked in
`scripts/level_loop/freeze.py:630-653`, so that `_explode_signature_block_lines(sections)`
is applied at `scripts/parse_doc2dict_with_config.py:3958` to the `sections` list.
2. In the resulting `rows` passed to `_explode_signature_block_lines()`
(`scripts/parse_doc2dict_with_config.py:3400-427`), assume there is a corporate
party-label ancestor whose title matches `_CORP_SUFFIX_LABEL_RE` (for example `ULURU
Inc.`) and has an empty or sig-shaped body, so that it satisfies the party-label condition
at `scripts/parse_doc2dict_with_config.py:317-333` and its `node_id` is added both to
`sig_line_node_ids` and to `sig_block_parents`
(`scripts/parse_doc2dict_with_config.py:334-335`).
3. Also assume that under this same parent there exists a deeper descendant node
representing non-signature content (for example a short header like `Acknowledgment` or
another bare-name predicted header with no enumeration and empty `body_direct`), so that
it is reachable via the tree from the parent but is not conceptually part of the signature
block; during the DOWN-expansion, `walk_roots` is built from both `/s/` carriers and
`sig_block_parents` at `scripts/parse_doc2dict_with_config.py:372-373`, and
`_walk_descendants()` (`scripts/parse_doc2dict_with_config.py:353-365`) traverses the full
subtree under the parent, ensuring this non-signature descendant is yielded as `d` in the
loop at `scripts/parse_doc2dict_with_config.py:375-399`.
4. For such a descendant `d` with a non-empty title, no section marker (so
`_has_section_marker_title(d)` at `scripts/parse_doc2dict_with_config.py:196-203` returns
False), and empty body, the `looks_sig` predicate at
`scripts/parse_doc2dict_with_config.py:388-396` evaluates True via the `(d_title and not
d_body)` "bare name as title" branch, causing its `node_id` to be added to
`sig_line_node_ids` at `scripts/parse_doc2dict_with_config.py:397-398`; later, PASS 3 at
`scripts/parse_doc2dict_with_config.py:400-425` reassigns this non-signature record's
`depth` to 2 and marks `_sig_line = True`, effectively misclassifying it as a
signature-line record solely because it sits somewhere in the descendant subtree of a
sig-block parent rather than being a true sibling of a `/s/` carrier.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 3632:3635
**Comment:**
*Logic Error: Expanding downward from every claimed parent via full descendant traversal is broader than the stated sibling-only fix and can pull unrelated nodes into signature classification under the same parent subtree. Restrict the parent-based expansion to immediate children (or re-check chain relation to a carrier) to avoid demoting non-signature content to L2.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
|
Triage agent — PR #77 comment review (read-only pass, no code changes) 5 inline comments reviewed:
WILL-DEFER items (5): Add Triage only — no code changes made this round. |
User description
Summary
Fifth stacked PR. Adds idx=4 (INDEMNIFICATION AGREEMENT, ULURU Inc. + Arindam Bose — same template as idx=0, different Indemnitee) as the fifth verified frozen baseline on top of idx=3 (PR #76).
Fixes a sig-page detection bug that was leaking the Company sig block fragments as L1 mid-document records.
Parser changes (1 surgical, shape-driven)
_CORP_SUFFIX_LABEL_RE(new) — SHAPE detector for mixed-case corporate names with entity suffixes:No company names encoded.
UP-climb in
_explode_signature_block_lines— now matches EITHER strict ALL-CAPS_SIG_BLOCK_LABEL_RE(already existed) OR the new corp-suffix shape. Tracks claimed parents insig_block_parents.DOWN-expansion — walks from BOTH /s/ carriers AND sig-block parents, catching siblings of the carrier that doc2dict split off as separate nodes (e.g. "By" alone, separated from "/s/ Terrance K. Wallberg…").
Verified output for idx=4
{L0:1, L1:24, L2:46, L3:6}Top-level structure
Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 5 --no-truncate --output-dir data/auto_parseexits 0 withok 5uv run scripts/level_loop/freeze.py 4 --forcereports word_coverage ≥ 90% (99.3%)uv run scripts/level_loop/regress.pyreports all 5 frozen idxs OKSource
http://www.sec.gov/Archives/edgar/data/1168220/000116822017000020/ex_10-26.htm
Why this matters for the corpus
The corp-suffix detector handles a very common SEC filing pattern: party labels using entity suffixes (Inc., LLC, Corp., L.P., etc.) instead of pure ALL-CAPS. Any subsequent agreement with a mixed-case corporate party label will be correctly identified as a sig-block parent, preventing mid-document leaks.
🤖 Generated with Claude Code
CodeAnt-AI Description
Detect mixed-case company signature blocks and keep split signature fragments together
What Changed
Impact
✅ Fewer missing signature lines✅ More complete agreement parsing✅ Cleaner company sign-off extraction🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.