idx=7: freeze (25 records) — Ingevity/WestRock Amendment No.1, multi-line title + page-footer exhibit strip#80
idx=7: freeze (25 records) — Ingevity/WestRock Amendment No.1, multi-line title + page-footer exhibit strip#80arthrod wants to merge 1 commit into
Conversation
…line L0 title merge + page-footer exhibit-title strip
idx=7 is an Amendment No.1 to the Crude Tall Oil and Black Liquor Soap
Skimmings Agreement between Ingevity Corp (Buyer) and WestRock (Seller,
two affiliates). Title is typeset across two visual lines:
AMENDMENT NO.1 TO CRUDE TALL OIL AND
BLACK LIQUOR SOAP SKIMMINGS AGREEMENT
Two parser additions, both purely structural (SHAPE-based, no phrase
blocklists):
1. `_merge_multiline_l0_title` (new): when the L0 has a preceding
sibling at the same parent (same parent_node_id, smaller node_id)
that is a predicted-header with empty body, an UPPERCASE title (≥60%
uppercase alphabetic chars), no structural-header / section /
recital / signature pattern, and no sentence-terminator, it is a
title-continuation line. Merge it into the L0 title (continuation
first, L0 second) joined by newline; mark the continuation record
as is_envelope=True so it drops from JSONL but stays in parquet.
Runs BEFORE `_split_l0_title_from_preamble` so the preamble split
sees the complete title. The disqualifier regex blocks ARTICLE /
SECTION / EXHIBIT / WHEREAS / numbered-marker / lettered-marker /
etc. patterns so body lines after the title aren't absorbed.
2. `_strip_page_footer_exhibit_titles` (new): doc2dict's HTML walker
promotes each repeating page-header tag ("Exhibit 10.7" stamped on
every page) into its own cls=exhibit section. The FIRST one is the
legitimate SEC envelope (is_envelope=True already). Subsequent ones
carry substantive body content from the page they head but their
title is the bare exhibit identifier — which becomes the highest-
priority probe in `_sort_records_by_source_position` and matches
the FIRST source occurrence, dragging the record to the document
start where `_drop_pre_title_position_records` would drop it as
pre-title chrome. Fix: clear the title on cls=exhibit/schedule/
appendix/annex records that are not is_envelope, not real subdocs
(per `_is_real_subdoc_title`), have a bare-identifier title, and
have substantive body (≥60 chars). The sorter falls back to the
body probe and locates the record at its true source position.
The 60-char threshold protects records like idx=5's "Schedule 1.1"
+ " \\n \\nEx. B-98" body (page chrome, not substantive) from being
stripped.
Runs:
- `_merge_multiline_l0_title` before `_split_l0_title_from_preamble`.
- `_strip_page_footer_exhibit_titles` after `_explode_signature_block_lines`
and before `_sort_records_by_source_position`.
idx=7 final stats:
- 25 records (1 L0, 9 L1, 12 L2, 3 L3)
- reconstruction: word_coverage 97.0%, char_ratio 97.7% (≥ 90% bar)
- L0 = "AMENDMENT NO.1 TO CRUDE TALL OIL AND\\nBLACK LIQUOR SOAP SKIMMINGS AGREEMENT"
Regression: idx=0..7 all OK (75, 532, 422, 102, 77, 18, 69, 25 records).
|
Mention Blocks like a regular teammate with your question or request: @blocks review this pull request Run |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
CodeAnt AI is reviewing your PR. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR enhances the document parsing pipeline with two new post-processing heuristics: one merges multi-line agreement titles by detecting uppercase-dominant continuations, and another strips spurious page-header exhibit identifiers before positional sorting. Both are integrated into ChangesParsing Pipeline Enhancements
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Suggested Labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces logic to handle multi-line agreement titles and spurious page-header exhibit titles in the document parsing pipeline. Specifically, it adds functions to detect and merge multi-line titles into a single L0 record and to strip bare exhibit identifiers from page footers to ensure correct sorting and retention of substantive body content. Additionally, it updates the frozen state and adds a new test record for document index 7. I have no feedback to provide as there were no review comments.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
data/auto_parse/level_freeze/frozen/idx_7.jsonl (1)
26-26: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueRemove trailing empty line.
JSONL files should end after the last record without an additional empty line. The trailing newline serves no purpose and may cause parsing issues in tools that count lines.
♻️ Proposed fix
{"idx": 7, "order": 24, "level": 2, "span": "WESTROCK MWV, LLC"} -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/auto_parse/level_freeze/frozen/idx_7.jsonl` at line 26, The file has an extra trailing empty line after the final JSON record; remove that blank line so the file ends immediately after the last JSON object (no additional empty newline), ensuring the final JSONL record is the last line in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/parse_doc2dict_with_config.py`:
- Line 2475: The line building combined uses list concatenation; replace it with
unpacking to be more idiomatic and efficient by constructing the join input as
"\n".join([*continuation_titles, existing_l0_title]) — update the expression
where combined is assigned (variable combined, and the identifiers
continuation_titles and existing_l0_title) accordingly.
- Around line 2343-2380: The function _looks_like_title_continuation contains
multiple early-return checks that can be collapsed into a single boolean
expression; replace the sequence of if-blocks with one return combining:
non-empty stripped, not matching _TITLE_CONTINUATION_DISQUALIFIERS,
re.search(r"[A-Za-z]{2,}", stripped), existence of alphabetic chars and
uppercase ratio >= 0.60 (compute upper_ratio from alphabetic chars), and not
stripped.endswith(('.', ':', ';', '?', '!')) so the function returns that
combined boolean directly while preserving the same helper names (stripped,
_TITLE_CONTINUATION_DISQUALIFIERS, alpha/upper_ratio logic).
---
Outside diff comments:
In `@data/auto_parse/level_freeze/frozen/idx_7.jsonl`:
- Line 26: The file has an extra trailing empty line after the final JSON
record; remove that blank line so the file ends immediately after the last JSON
object (no additional empty newline), ensuring the final JSONL record is the
last line in the file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 96eabe5a-b4a5-466a-935f-232d64ef6cf3
📒 Files selected for processing (3)
data/auto_parse/level_freeze/frozen/idx_7.jsonldata/auto_parse/level_freeze/state.jsonscripts/parse_doc2dict_with_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (Custom checks)
**/*.py: Run smoke test for Python web servers: start server, wait for ready signal, hit GET /health or GET /, assert HTTP 200. Timeout: 30s. Fail if smoke test fails.
Run smoke test for Python CLI: run<cli> --help, assert exit code 0. Fail if smoke test fails.
Runuv run pytest --cov=<src_package> --cov-report=term-missing --cov-branch --cov-fail-under=80 -qfor Python coverage. Fail if exit code is non-zero, branch coverage < 70%, or line coverage < 80%.
Runuv run ruff check . --difffor Python linting. Fail if exit code is non-zero and list each violation.
Runuv run ruff format --check --diff .for Python formatting. Fail if exit code is non-zero and list each unformatted file.
Scan diffs for newly added Python suppression comments (# noqa, # type: ignore). Each suppression must have an inline justification comment. Fail if any new suppression lacks justification. Warn if total new suppressions > 3 in a single PR.
Runuv run ruff check --select I,F401 .to verify Python import ordering and detect unused imports. Fail if violations found.
Run the full Python test suite:uv run pytest --tb=line -qon origin/main to capture baseline pass/fail counts, anduv run pytest --tb=short -qon PR branch. Fail immediately if exit code is non-zero.
Runuv run typy checkfor Python type checking if .py files exist in diff or project has py.typed marker. Fail if exit code is non-zero. If typy is not available, use configured mypy or pyright instead. Fail with 'No Python type checker configured' if none is found.
Scan diffs for new baretype: ignorecomments (without error codes) in Python files andcast()calls without explanatory comments. Warn for each. Fail if baretype: ignorecount > 3.
Files:
scripts/parse_doc2dict_with_config.py
**/*.{py,ts,tsx}
📄 CodeRabbit inference engine (Custom checks)
For each changed production file, verify at least one corresponding test file exists or already exists in the repo with assertions covering changed symbols. Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods. Exempt: config files, type declaration files (.d.ts), migrations, and generated code.
Files:
scripts/parse_doc2dict_with_config.py
🪛 Ruff (0.15.12)
scripts/parse_doc2dict_with_config.py
[warning] 2343-2343: Too many return statements (8 > 6)
(PLR0911)
[warning] 2377-2379: Return the condition not stripped.endswith(('.', ':', ';', '?', '!')) directly
Replace with return not stripped.endswith(('.', ':', ';', '?', '!'))
(SIM103)
[warning] 2382-2382: Too many branches (13 > 12)
(PLR0912)
[warning] 2475-2475: Consider [*continuation_titles, existing_l0_title] instead of concatenation
Replace with [*continuation_titles, existing_l0_title]
(RUF005)
🔍 Remote MCP
Based on the available information and the tool limitations encountered, let me provide you with a focused analysis of the PR review context:
PR Review Context Summary
Code Changes Analysis
File: scripts/parse_doc2dict_with_config.py [275 new lines added]
The PR introduces two complementary post-processing heuristics to the EX-10 document parsing pipeline:
-
_merge_multiline_l0_title()- Addresses multi-line agreement titles- Problem solved: Agreement titles spanning multiple lines (typical in legal documents) were being treated as separate records
- Solution: Detects preceding sibling with UPPERCASE-dominant text marked as predicted-header
- Key details:
- Runs BEFORE
_split_l0_title_from_preamble()to process full merged title - Uses disqualifier regexes to avoid incorrectly merging section headers (ARTICLE, SECTION, EXHIBIT, WHEREAS patterns)
- Marks merged continuation records as
is_envelope=Trueto suppress duplicate JSONL output while preserving parquet records
- Runs BEFORE
- Validation: Tested against idx=7 document; title successfully merged
-
_strip_page_footer_exhibit_titles()- Removes spurious page headers- Problem solved: Page-footer/header exhibit identifiers (e.g., "EXHIBIT 10.7") repeating across pages were being treated as high-priority title probes, causing incorrect document repositioning
- Solution: Clears bare-identifier titles on exhibits/schedules/appendices/annexes
- Key constraints:
- Only targets non-envelope, non-subdoc records (preserves legitimate structural headers)
- Requires substantive body content (≥60 characters) to avoid removing legitimate small artifacts
- Runs BEFORE the source-position sorter so the system falls back to body content analysis
- Validation: Preserves idx=5 artifacts; removes spurious "Exhibit 10.7" records from idx=7
Data Changes
Files: data/auto_parse/level_freeze/ [33 lines total, +8/-1 in state.json]
- idx_7.jsonl: 25 records representing the AMENDMENT document (Ingevity/WestRock, March 2017)
- Composition: L0:1 (root), L1:9, L2:12, L3:3 (max depth 3)
- Quality metrics: 97.0% word_coverage, 97.7% char_ratio (indicating accurate reconstruction)
- state.json: Updated freeze history with new event tracking idx=7's 25-record baseline
Related PR Overlaps
The related PRs (#12, #15, #18, #20) all modify the same parse_doc2dict_with_config.py file with overlapping post-processing logic:
- PR
#12: Title-fragment merging (similar multiline title handling) - PR
#15: Pipeline heuristics insertion around envelope/title handling - PR
#18: Title merging + SEC page-header exhibit stripping (direct overlap) - PR
#20: Envelope handling before L0/title merging
Observation: These PRs appear to be iterative refinements to the same parsing pipeline stage. Code review should verify:
- Merge order consistency across related PRs to avoid conflicts
- That heuristics don't duplicate or contradict each other
- Whether these changes should be consolidated
Testing & Validation
Per PR description, validation completed:
- ✅ Parsing tests passed
- ✅ Freeze baseline tests passed
- ✅ Regression tests passed (idx=0..7)
- ✅ Inspector verification: multiline title merge confirmed, spurious "Exhibit 10.7" removed, idx=5 artifacts preserved
- ✅ Source document URL provided
Note: The user requirements mention "Stitch" for design generation, but this does not apply to code/parsing logic review. Context7 and DeepWiki tools encountered resource limits, but sufficient information was extracted from the PR context and related metadata to support review.
🔇 Additional comments (2)
data/auto_parse/level_freeze/state.json (1)
10-11: LGTM!Also applies to: 169-174
scripts/parse_doc2dict_with_config.py (1)
2317-2484: The coding guideline exempts private helper functions from strict test coverage requirements.The new functions
_merge_multiline_l0_titleand_strip_page_footer_exhibit_titlesare private (prefixed with_), not exported public APIs. The guideline states: "Fail if a changed production file has zero associated test file and introduces new exported functions, classes, or public methods" — private functions are explicitly outside this requirement. Additionally,scripts/parse_doc2dict_with_config.pyis a utility script file (not part of the mainclause_extractpackage), and the functions are called fromparse_one(lines 4350, 4409), providing indirect integration coverage.> Likely an incorrect or invalid review comment.
| def _looks_like_title_continuation(title: str) -> bool: | ||
| """Return True if `title` looks like an upper-line of a multi-line | ||
| agreement title (not a section heading, not a body fragment). | ||
|
|
||
| A title-continuation line: | ||
| - Has alphabetic content (at least one word of 2+ letters). | ||
| - Is predominantly uppercase letters (≥ 60% of alphabetic chars | ||
| are uppercase) — agreement titles are typeset in ALL CAPS. | ||
| - Doesn't match any structural-header / section / recital / | ||
| signature pattern. | ||
| - Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`) | ||
| — agreement titles usually break mid-phrase across lines, often | ||
| ending with conjunctions ("AND", "OF") or nouns. | ||
| """ | ||
| if not title: | ||
| return False | ||
| stripped = title.strip() | ||
| if not stripped: | ||
| return False | ||
| # Must not match any structural / section / recital / signature pattern. | ||
| if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped): | ||
| return False | ||
| # Must have at least one alphabetic word of 2+ letters. | ||
| if not re.search(r"[A-Za-z]{2,}", stripped): | ||
| return False | ||
| # Compute uppercase-ratio of alphabetic chars (>= 60% uppercase). | ||
| alpha = [c for c in stripped if c.isalpha()] | ||
| if not alpha: | ||
| return False | ||
| upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha) | ||
| if upper_ratio < 0.60: | ||
| return False | ||
| # Must not end with a sentence-terminating punctuation. Title lines | ||
| # are noun-phrases that wrap visually; bodies end with periods. | ||
| if stripped.endswith(('.', ':', ';', '?', '!')): | ||
| return False | ||
| return True | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider simplifying the final return statement.
The function has a straightforward boolean condition that can be returned directly.
♻️ Proposed simplification
- if stripped.endswith(('.', ':', ';', '?', '!')):
- return False
- return True
+ return not stripped.endswith(('.', ':', ';', '?', '!'))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _looks_like_title_continuation(title: str) -> bool: | |
| """Return True if `title` looks like an upper-line of a multi-line | |
| agreement title (not a section heading, not a body fragment). | |
| A title-continuation line: | |
| - Has alphabetic content (at least one word of 2+ letters). | |
| - Is predominantly uppercase letters (≥ 60% of alphabetic chars | |
| are uppercase) — agreement titles are typeset in ALL CAPS. | |
| - Doesn't match any structural-header / section / recital / | |
| signature pattern. | |
| - Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`) | |
| — agreement titles usually break mid-phrase across lines, often | |
| ending with conjunctions ("AND", "OF") or nouns. | |
| """ | |
| if not title: | |
| return False | |
| stripped = title.strip() | |
| if not stripped: | |
| return False | |
| # Must not match any structural / section / recital / signature pattern. | |
| if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped): | |
| return False | |
| # Must have at least one alphabetic word of 2+ letters. | |
| if not re.search(r"[A-Za-z]{2,}", stripped): | |
| return False | |
| # Compute uppercase-ratio of alphabetic chars (>= 60% uppercase). | |
| alpha = [c for c in stripped if c.isalpha()] | |
| if not alpha: | |
| return False | |
| upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha) | |
| if upper_ratio < 0.60: | |
| return False | |
| # Must not end with a sentence-terminating punctuation. Title lines | |
| # are noun-phrases that wrap visually; bodies end with periods. | |
| if stripped.endswith(('.', ':', ';', '?', '!')): | |
| return False | |
| return True | |
| def _looks_like_title_continuation(title: str) -> bool: | |
| """Return True if `title` looks like an upper-line of a multi-line | |
| agreement title (not a section heading, not a body fragment). | |
| A title-continuation line: | |
| - Has alphabetic content (at least one word of 2+ letters). | |
| - Is predominantly uppercase letters (≥ 60% of alphabetic chars | |
| are uppercase) — agreement titles are typeset in ALL CAPS. | |
| - Doesn't match any structural-header / section / recital / | |
| signature pattern. | |
| - Doesn't end with a sentence-terminator (`.`, `:`, `;`, `?`, `!`) | |
| — agreement titles usually break mid-phrase across lines, often | |
| ending with conjunctions ("AND", "OF") or nouns. | |
| """ | |
| if not title: | |
| return False | |
| stripped = title.strip() | |
| if not stripped: | |
| return False | |
| # Must not match any structural / section / recital / signature pattern. | |
| if _TITLE_CONTINUATION_DISQUALIFIERS.match(stripped): | |
| return False | |
| # Must have at least one alphabetic word of 2+ letters. | |
| if not re.search(r"[A-Za-z]{2,}", stripped): | |
| return False | |
| # Compute uppercase-ratio of alphabetic chars (>= 60% uppercase). | |
| alpha = [c for c in stripped if c.isalpha()] | |
| if not alpha: | |
| return False | |
| upper_ratio = sum(1 for c in alpha if c.isupper()) / len(alpha) | |
| if upper_ratio < 0.60: | |
| return False | |
| # Must not end with a sentence-terminating punctuation. Title lines | |
| # are noun-phrases that wrap visually; bodies end with periods. | |
| return not stripped.endswith(('.', ':', ';', '?', '!')) |
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 2343-2343: Too many return statements (8 > 6)
(PLR0911)
[warning] 2377-2379: Return the condition not stripped.endswith(('.', ':', ';', '?', '!')) directly
Replace with return not stripped.endswith(('.', ':', ';', '?', '!'))
(SIM103)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` around lines 2343 - 2380, The function
_looks_like_title_continuation contains multiple early-return checks that can be
collapsed into a single boolean expression; replace the sequence of if-blocks
with one return combining: non-empty stripped, not matching
_TITLE_CONTINUATION_DISQUALIFIERS, re.search(r"[A-Za-z]{2,}", stripped),
existence of alphabetic chars and uppercase ratio >= 0.60 (compute upper_ratio
from alphabetic chars), and not stripped.endswith(('.', ':', ';', '?', '!')) so
the function returns that combined boolean directly while preserving the same
helper names (stripped, _TITLE_CONTINUATION_DISQUALIFIERS, alpha/upper_ratio
logic).
| continuation.reverse() # source order (oldest first) | ||
| continuation_titles = [(c.get("title") or "").strip() for c in continuation] | ||
| existing_l0_title = (l0.get("title") or "").strip() | ||
| combined = "\n".join(continuation_titles + [existing_l0_title]) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider using unpacking instead of concatenation.
Python unpacking is more idiomatic and slightly more efficient for list construction.
♻️ Proposed refactor
- combined = "\n".join(continuation_titles + [existing_l0_title])
+ combined = "\n".join([*continuation_titles, existing_l0_title])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| combined = "\n".join(continuation_titles + [existing_l0_title]) | |
| combined = "\n".join([*continuation_titles, existing_l0_title]) |
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 2475-2475: Consider [*continuation_titles, existing_l0_title] instead of concatenation
Replace with [*continuation_titles, existing_l0_title]
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/parse_doc2dict_with_config.py` at line 2475, The line building
combined uses list concatenation; replace it with unpacking to be more idiomatic
and efficient by constructing the join input as "\n".join([*continuation_titles,
existing_l0_title]) — update the expression where combined is assigned (variable
combined, and the identifiers continuation_titles and existing_l0_title)
accordingly.
| if (sib.get("cls") or "") != "predicted header": | ||
| break | ||
| if (sib.get("body_direct") or "").strip(): | ||
| break | ||
| title = (sib.get("title") or "").strip() | ||
| if not _looks_like_title_continuation(title): | ||
| break | ||
| continuation.append(sib) |
There was a problem hiding this comment.
Suggestion: The title-continuation heuristic is broad enough to classify uppercase cover metadata (for example, registrant/company lines immediately before the agreement title) as a continuation line, so those rows can be merged into the L0 title and then hidden as envelope records. This corrupts the agreement title and prevents the normal pre-title cover cleanup from handling those records correctly. Tighten the continuation predicate to explicitly exclude company/party-name style lines (or require stronger title-shape signals than uppercase ratio alone). [incorrect condition logic]
Severity Level: Major ⚠️
- ❌ Agreement titles include registrant/company names for affected docs.
- ⚠️ Downstream consumers mis-index or group agreements by title.Steps of Reproduction ✅
1. In a Python shell, import the helper from `scripts/parse_doc2dict_with_config.py` lines
103–204: `from scripts.parse_doc2dict_with_config import _merge_multiline_l0_title`.
2. Construct a `rows` list with at least two section dicts sharing the same
`parent_node_id`: (a) an L0 agreement title record `l0` with `depth=0`,
`scope="agreement"`, non-empty `title`, and `node_id=20`; (b) a preceding cover-metadata
sibling with `node_id=19`, `cls="predicted header"`, `scope="agreement"`,
`is_envelope=False`, empty `body_direct`, and an all-caps registrant-style title like
`"INGEVITY CORPORATION"`.
3. Call `_merge_multiline_l0_title(rows)` (implementation at lines 138–204). In the
backward sibling walk at lines 169–186, the cover line passes all guards: it is not
envelope/trailer (lines 174–177), has `cls=="predicted header"` (line 179), has empty
`body_direct` (line 181), and `_looks_like_title_continuation()` at lines 64–100 returns
True because the title is all caps, not matched by `_TITLE_CONTINUATION_DISQUALIFIERS`
(lines 41–60), and has no sentence-ending punctuation.
4. Observe that the function appends this cover line into the L0 title (lines 191–197) and
marks the sibling `is_envelope=True` (lines 199–202). Downstream,
`_drop_pre_title_cover_records()` at lines 3230–3256 skips envelope records and can no
longer treat this row as cover metadata, so the final JSONL agreement title incorrectly
contains the registrant/company line.Fix in Cursor | Fix in VSCode Claude
(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** scripts/parse_doc2dict_with_config.py
**Line:** 2458:2465
**Comment:**
*Incorrect Condition Logic: The title-continuation heuristic is broad enough to classify uppercase cover metadata (for example, registrant/company lines immediately before the agreement title) as a continuation line, so those rows can be merged into the L0 title and then hidden as envelope records. This corrupts the agreement title and prevents the normal pre-title cover cleanup from handling those records correctly. Tighten the continuation predicate to explicitly exclude company/party-name style lines (or require stronger title-shape signals than uppercase ratio alone).
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
|
Triage agent — PR #80 comment review (read-only pass, no code changes) 3 inline comments reviewed:
WILL-DEFER items (1): Title-continuation heuristic — add guard to exclude registrant/company lines from L0 merge. Triage only — no code changes made this round. |
User description
Summary
Eighth stacked PR. Adds idx=7 (AMENDMENT NO.1 TO CRUDE TALL OIL AND BLACK LIQUOR SOAP SKIMMINGS AGREEMENT between Ingevity Corporation and WestRock Shared Services / WestRock MWV, March 2017) as the eighth verified frozen baseline on top of idx=6 (PR #79).
Parser changes (2 surgical, shape-driven)
_merge_multiline_l0_title(~line 2317) — handles multi-line agreement titles. When the L0 record has a preceding sibling (same parent_node_id, smaller node_id) that's an UPPERCASE predicted-header continuation line, merges them. Disqualifier regex blocks ARTICLE/SECTION/EXHIBIT/SCHEDULE/APPENDIX/ANNEX/WITNESSETH/WHEREAS/RECITALS/NOW THEREFORE/IN WITNESS WHEREOF/numbered/lettered patterns. Inspector noted doc2dict already pre-merged idx=7's title at node_id=3; the new pass mainly suppresses the duplicate first-line sibling for this idx. The more general pattern handles future cases where doc2dict fails to merge._strip_page_footer_exhibit_titles(~line 3258) — handles doc2dict promoting repeating page-header tags ("Exhibit 10.7" stamped on every page) into spuriouscls=exhibitrecords. Without the fix, those bare-identifier titles match the FIRST source occurrence (the cover-page header) via the sort probe, dragging records to document start where they get dropped as pre-title chrome. Fix: clears the title oncls in {exhibit, schedule, appendix, annex}records that are NOTis_envelope, NOT real subdocs (per_is_real_subdoc_title), have a bare-identifier title, AND have substantive body (≥60 chars). The 60-char threshold protects idx=5's tiny "Schedule 1.1" + "Ex. B-98" trailing artifact.Verified output for idx=7
{L0:1, L1:9, L2:12, L3:3}Top-level structure (verbatim L0 + sig area)
Test plan
uv run scripts/parse_doc2dict_with_config.py --limit 8 --no-truncate --output-dir data/auto_parseexits 0 withok 8uv run scripts/level_loop/freeze.py 7 --forcereports word_coverage ≥ 90% (97.0%)uv run scripts/level_loop/regress.pyreports all 8 frozen idxs OKSource
http://www.sec.gov/Archives/edgar/data/1653477/000165347717000051/ex107amendmentno1toctoagre.htm
Why this matters for the corpus
🤖 Generated with Claude Code
CodeAnt-AI Description
Freeze idx=7: preserve a multi-line agreement title and keep exhibit pages in the right place
What Changed
Impact
✅ Correct agreement titles✅ Fewer lost exhibit sections✅ More complete document extractions🔄 Retrigger CodeAnt AI Review
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.